All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] Reduce writeback from page reclaim context V4
@ 2010-07-19 13:11 ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli, Mel Gorman

Sorry for the long delay, I got side-tracked on other bugs.

This is a follow-on series from the series "Avoid overflowing of stack
during page reclaim". It eliminates writeback requiring a filesystem from
direct reclaim and follows on by reducing the amount of IO required from
page reclaim to mitigate any corner cases from the modification.

Changelog since V3
  o Distinguish between file and anon related IO from page reclaim
  o Allow anon writeback from reclaim context
  o Sync old inodes first in background writeback
  o Pre-emptively clean pages when dirty pages are encountered on the LRU
  o Rebase to 2.6.35-rc5

Changelog since V2
  o Add acks and reviewed-bys
  o Do not lock multiple pages at the same time for writeback as it's unsafe
  o Drop the clean_page_list function. It alters timing with very little
    benefit. Without the contiguous writing, it doesn't do much to simplify
    the subsequent patches either
  o Throttle processes that encounter dirty pages in direct reclaim. Instead
    wakeup flusher threads to clean the number of pages encountered that were
    dirty
 
Changelog since V1
  o Merge with series that reduces stack usage in page reclaim in general
  o Allow memcg to writeback pages as they are not expected to overflow stack
  o Drop the contiguous-write patch for the moment

There is a problem in the stack depth usage of page reclaim. Particularly
during direct reclaim, it is possible to overflow the stack if it calls into
the filesystems writepage function. This patch series begins by preventing
writeback from direct reclaim and allowing btrfs and xfs to writeback from
kswapd context. As this is a potentially large change, the remainder of
the series aims to reduce any filesystem writeback from page reclaim and
depend more on background flush.

The first patch in the series is a roll-up of what should currently be
in mmotm. It's provided for convenience of testing.

Patch 2 and 3 note that it is important to distinguish between file and anon
page writeback from page reclaim as they use stack to different depths. It
updates the trace points and scripts appropriately noting which mmotm patch
they should be merged with.

Patch 4 prevents direct reclaim writing out filesystem pages while still
allowing writeback of anon pages which is in less danger of stack overflow
and doesn't have something like background flush to clean the pages.
For filesystem pages, flusher threads are asked to clean the number of
pages encountered, the caller waits on congestion and puts the pages back
on the LRU.  For lumpy reclaim, the caller will wait for a time calling the
flusher multiple times waiting on dirty pages to be written out before trying
to reclaim the dirty pages a second time. This increases the responsibility
of kswapd somewhat because it's now cleaning pages on behalf of direct
reclaimers but unlike background flushers, kswapd knows what zone pages
need to be cleaned from. As it is async IO, it should not cause kswapd to
stall (at least until the queue is congested) but the order that pages are
reclaimed on the LRU is altered. Dirty pages that would have been reclaimed
by direct reclaimers are getting another lap on the LRU. The dirty pages
could have been put on a dedicated list but this increased counter overhead
and the number of lists and it is unclear if it is necessary.

Patches 5 and 6 revert chances on XFS and btrfs that ignore writeback from
reclaim context which is a relatively recent change. extX could be modified
to allow kswapd to writeback but it is a relatively deep change. There may
be some collision with items in the filesystem git trees but it is expected
to be trivial to resolve.

Patch 7 makes background flush behave more like kupdate by syncing old or
expired inodes first as implemented by Wu Fengguang. As filesystem pages are
added onto the inactive queue and only promoted if referenced, it makes sense
to write old pages first to reduce the chances page reclaim is initiating IO.

Patch 8 notes that dirty pages can still be found at the end of the LRU.
If a number of them are encountered, it's reasonable to assume that a similar
number of dirty pages will be discovered in the very near future as that was
the dirtying pattern at the time. The patch pre-emptively kicks background
flusher to clean a number of pages creating feedback from page reclaim to
background flusher that is based on scanning rates. Christoph has described
discussions on this patch as a "band-aid" but Rik liked the idea and the
patch does have interesting results so is worth a closer look.

I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each machine
had 3G of RAM and the CPUs were

X86:	Intel P4 2 core
X86-64:	AMD Phenom 4-core
PPC64:	PPC970MP

Each used a single disk and the onboard IO controller. Dirty ratio was left
at 20. Tests on an earlier series indicated that moving to 40 did not make
much difference. The filesystem used for all tests was XFS.

Four kernels are compared.

traceonly-v4r7		is the first 3 patches of this series
nodirect-v4r7		is the first 6 patches
flusholdest-v4r7	makes background flush behave like kupdated (patch 1-7)
flushforward-v4r7	pre-emptively cleans pages when encountered on the LRU (patch 1-8)

The results on each test is broken up into two parts.  The first part is a
report based on the ftrace postprocessing script in patch 4 and reports on
direct reclaim and kswapd activity. The second part reports what percentage
of time was spent in direct reclaim and kswapd being awake.

To work out the percentage of time spent in direct reclaim, I used
/usr/bin/time to get the User + Sys CPU time. The stalled time was taken
from the post-processing script.  The total time is (User + Sys + Stall)
and obviously the percentage is of stalled over total time.

I am omitting the actual performance results simply because they are not
interesting with very few significant changes.

kernbench
=========

No writeback from reclaim initiated and no performance change of significance.

IOzone
======

No writeback from reclaim initiated and no performance change of significance.


SysBench
========

The results were based on a read/write and as the machine is under-provisioned
for the type of tests, figures are very unstable so not reported.  with
variances up to 15%. Part of the problem is that larger thread counts push
the test into swap as the memory is insufficient and destabilises results
further. I could tune for this, but it was reclaim that was important.

X86
                                 raceonly-v4r7  nodirect-v4r7 flusholdest-v4r7 flushforward-v4r7
Direct reclaims                                 18         25          6        196 
Direct reclaim pages scanned                  1615       1662        605      22233 
Direct reclaim write file async I/O             40          0          0          0 
Direct reclaim write anon async I/O              0          0         13          9 
Direct reclaim write file sync I/O               0          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0 
Wake kswapd requests                        171039     401450     313156      90960 
Kswapd wakeups                                 685        532        611        262 
Kswapd pages scanned                      14272338   12209663   13799001    5230124 
Kswapd reclaim write file async I/O         581811      23047      23795        759 
Kswapd reclaim write anon async I/O         189590     124947     114948      42906 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (ms)              0.00       0.91       0.92       1.31 
Time kswapd awake (ms)                     1079.32    1039.42    1194.82    1091.06 

User/Sys Time Running Test (seconds)       1312.24   1241.37   1308.16   1253.15
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%     0.00%
Total Elapsed Time (seconds)               8411.28   7471.15   8292.18   8170.16
Percentage Time kswapd Awake                 3.45%     0.00%     0.00%     0.00%

Dirty file pages from X86 were not much of a problem to begin with and the
patches eliminate them as expected. What is interesting is nodirct-v4r7
made such a large difference to the amount of filesystem pages that had
to be written back. Apparently, background flush must have been doing a
better job getting them cleaned in time and the direct reclaim stalls are
harmful overall. Waking background threads for dirty pages made a very large
difference to the number of pages written back. With all patches applied,
just 759 filesystem pages were written back in comparison to 581811 in the
vanilla kernel and overall the number of pages scanned was reduced.

X86-64
                                 traceonly-v4r7 nodirect-v4r7 flusholdest-v4r7 flushforward-v4r7
Direct reclaims                                795       1662       2131       6459 
Direct reclaim pages scanned                204900     127300     291647     317035 
Direct reclaim write file async I/O          53763          0          0          0 
Direct reclaim write anon async I/O           1256        730       6114         20 
Direct reclaim write file sync I/O              10          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0 
Wake kswapd requests                        690850    1457411    1713379    1648469 
Kswapd wakeups                                1683       1353       1275       1171 
Kswapd pages scanned                      17976327   15711169   16501926   12634291 
Kswapd reclaim write file async I/O         818222      26560      42081       6311 
Kswapd reclaim write anon async I/O         245442     218708     209703     205254 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (ms)             13.50      41.19      69.56      51.32 
Time kswapd awake (ms)                     2243.53    2515.34    2767.58    2607.94 

User/Sys Time Running Test (seconds)        687.69    650.83    653.28    640.38
Percentage Time Spent Direct Reclaim         0.01%     0.00%     0.00%     0.00%
Total Elapsed Time (seconds)               6954.05   6472.68   6508.28   6211.11
Percentage Time kswapd Awake                 0.04%     0.00%     0.00%     0.00%

Direct reclaim of filesystem pages is eliminated as expected. Again, the
overall number of pages that need to be written back by page reclaim is
reduced. Flushing just the oldest inode was not much of a help in terms
of how many pages needed to be written back from reclaim but pre-emptively
waking flusher threads helped a lot.

Oddly, more time was spent in direct reclaim with the patches as a greater
number of anon pages needed to be written back. It's possible this was
due to the test making more forward progress as indicated by the shorter
running time.

PPC64
                                 traceonly-v4r7 nodirect-v4r7 flusholdest-v4r7 flushforward-v4r7
Direct reclaims                               1517      34527      32365      51973 
Direct reclaim pages scanned                144496    2041199    1950282    3137493 
Direct reclaim write file async I/O          28147          0          0          0 
Direct reclaim write anon async I/O            463      25258      10894          0 
Direct reclaim write file sync I/O               7          0          0          0 
Direct reclaim write anon sync I/O               0          1          0          0 
Wake kswapd requests                       1126060    6578275    6281512    6649558 
Kswapd wakeups                                 591        262        229        247 
Kswapd pages scanned                      16522849   12277885   11076027    7614475 
Kswapd reclaim write file async I/O        1302640      50301      43308       8658 
Kswapd reclaim write anon async I/O         150876     146600     159229     134919 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (ms)             32.28     481.52     535.15     342.97 
Time kswapd awake (ms)                     1694.00    4789.76    4426.42    4309.49 

User/Sys Time Running Test (seconds)       1294.96    1264.5   1254.92   1216.92
Percentage Time Spent Direct Reclaim         0.03%     0.00%     0.00%     0.00%
Total Elapsed Time (seconds)               8876.80   8446.49   7644.95   7519.83
Percentage Time kswapd Awake                 0.05%     0.00%     0.00%     0.00%

Direct reclaim filesystem writes are eliminated but the scan rates went way
up. It implies that direct reclaim was spinning quite a bit and finding
clean pages allowing the test to complete 22 minutes faster. S Flushing
oldest inodes helped but pre-emptively waking background flushers helped
more in terms of the number of pages cleaned by page reclaim.

Stress HighAlloc
================

This test builds a large number of kernels simultaneously so that the total
workload is 1.5 times the size of RAM. It then attempts to allocate all of
RAM as huge pages. The metric is the percentage of memory allocated using
load (Pass 1), a second attempt under load (Pass 2) and when the kernel
compiles are finishes and the system is quiet (At Rest). The patches have
little impact on the success rates.

X86
                                 traceonly-v4r7 nodirect-v4r7 flusholdest-v4r7 flushforward-v4r7
Direct reclaims                                623        607        611        491 
Direct reclaim pages scanned                126515     117477     142502      91649 
Direct reclaim write file async I/O            896          0          0          0 
Direct reclaim write anon async I/O          35286      27508      35688      24819 
Direct reclaim write file sync I/O             580          0          0          0 
Direct reclaim write anon sync I/O           13932      12301      15203      11509 
Wake kswapd requests                          1561       1650       1618       1152 
Kswapd wakeups                                 183        209        211         79 
Kswapd pages scanned                       9391908    9144543   11418802    6959545 
Kswapd reclaim write file async I/O          92730       7073       8215        807 
Kswapd reclaim write anon async I/O         946499     831573    1164240     833063 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (ms)           4653.17    4193.28    5292.97    6954.96 
Time kswapd awake (ms)                     4618.67    3787.74    4856.45   55704.90 

User/Sys Time Running Test (seconds)       2103.48   2161.14      2131   2160.01
Percentage Time Spent Direct Reclaim         0.33%     0.00%     0.00%     0.00%
Total Elapsed Time (seconds)               6996.43   6405.43   7584.74   8904.53
Percentage Time kswapd Awake                 0.80%     0.00%     0.00%     0.00%

Total time running the test was increased unfortunately but this was
the only instance it occurred. Similar story as elsewhere otherwise -
filesystem direct writes are eliminated and overall filesystem writes from
page reclaim are significantly reduced to almost negligible levels (0.01%
of pages scanned by kswapd resulted in a filesystem write for the full
series in comparison to 0.99% in the vanilla kernel).

X86-64
                traceonly-v4r7     nodirect-v4r7  flusholdest-v4r7 flushforward-v4r7
Direct reclaims                               1275       1300       1222       1224 
Direct reclaim pages scanned                156940     152253     148993     148726 
Direct reclaim write file async I/O           2472          0          0          0 
Direct reclaim write anon async I/O          29281      26887      28073      26283 
Direct reclaim write file sync I/O            1943          0          0          0 
Direct reclaim write anon sync I/O           11777       9258      10256       8510 
Wake kswapd requests                          4865      12895       1185       1176 
Kswapd wakeups                                 869        757        789        822 
Kswapd pages scanned                      41664053   30419872   29602438   42603986 
Kswapd reclaim write file async I/O         550544      16092      12775       4414 
Kswapd reclaim write anon async I/O        2409931    1964446    1779486    1667076 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (ms)           8908.93    7920.53    6192.17    5926.47 
Time kswapd awake (ms)                     6045.11    5486.48    3945.35    3367.01 

User/Sys Time Running Test (seconds)       2813.44   2818.17    2801.8   2803.61
Percentage Time Spent Direct Reclaim         0.21%     0.00%     0.00%     0.00%
Total Elapsed Time (seconds)              11217.45  10286.90   8534.22   8332.84
Percentage Time kswapd Awake                 0.03%     0.00%     0.00%     0.00%

Unlike X86, total time spent on the test was significantly reduced and like
elsewhere, filesystem IO due to reclaim is way down.

PPC64
                traceonly-v4r7     nodirect-v4r7  flusholdest-v4r7 flushforward-v4r7
Direct reclaims                                665        709        652        663 
Direct reclaim pages scanned                145630     125161     116556     124718 
Direct reclaim write file async I/O            946          0          0          0 
Direct reclaim write anon async I/O          26983      23160      28531      23360 
Direct reclaim write file sync I/O             596          0          0          0 
Direct reclaim write anon sync I/O           17517      13635      16114      13121 
Wake kswapd requests                           271        302        299        278 
Kswapd wakeups                                 181        164        158        172 
Kswapd pages scanned                      68789711   68058349   54613548   64905996 
Kswapd reclaim write file async I/O         159196      20569      17538       2475 
Kswapd reclaim write anon async I/O        2311178    1962398    1811115    1829023 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (ms)          13784.95   12895.39   11132.26   11785.26 
Time kswapd awake (ms)                    13331.51   12603.74   10956.18   11479.22 

User/Sys Time Running Test (seconds)       3567.03   2730.23   2682.86   2668.08
Percentage Time Spent Direct Reclaim         0.33%     0.00%     0.00%     0.00%
Total Elapsed Time (seconds)              15282.74  14347.67  12614.61  13386.85
Percentage Time kswapd Awake                 0.08%     0.00%     0.00%     0.00%

Similar story, the test completed faster and page reclaim IO is down.

Overall, the patches seem to help. Reclaim activity is reduced while test
times are generally improved. A big concern with V3 was that direct reclaim
not being able to write pages could lead to unexpected behaviour. This
series mitigates that risk by reducing the amount of IO initiated by page
reclaim making it a rarer event.

Mel Gorman (7):
  MMOTM MARKER
  vmscan: tracing: Update trace event to track if page reclaim IO is
    for anon or file pages
  vmscan: tracing: Update post-processing script to distinguish between
    anon and file IO from page reclaim
  vmscan: Do not writeback filesystem pages in direct reclaim
  fs,btrfs: Allow kswapd to writeback pages
  fs,xfs: Allow kswapd to writeback pages
  vmscan: Kick flusher threads to clean pages when reclaim is
    encountering dirty pages

Wu Fengguang (1):
  writeback: sync old inodes first in background writeback

 .../trace/postprocess/trace-vmscan-postprocess.pl  |   89 +++++++++-----
 Makefile                                           |    2 +-
 fs/btrfs/disk-io.c                                 |   21 +----
 fs/btrfs/inode.c                                   |    6 -
 fs/fs-writeback.c                                  |   19 +++-
 fs/xfs/linux-2.6/xfs_aops.c                        |   15 ---
 include/trace/events/vmscan.h                      |    8 +-
 mm/vmscan.c                                        |  121 ++++++++++++++++++-
 8 files changed, 195 insertions(+), 86 deletions(-)


^ permalink raw reply	[flat|nested] 177+ messages in thread

* [PATCH 0/8] Reduce writeback from page reclaim context V4
@ 2010-07-19 13:11 ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli, Mel Gorman

Sorry for the long delay, I got side-tracked on other bugs.

This is a follow-on series from the series "Avoid overflowing of stack
during page reclaim". It eliminates writeback requiring a filesystem from
direct reclaim and follows on by reducing the amount of IO required from
page reclaim to mitigate any corner cases from the modification.

Changelog since V3
  o Distinguish between file and anon related IO from page reclaim
  o Allow anon writeback from reclaim context
  o Sync old inodes first in background writeback
  o Pre-emptively clean pages when dirty pages are encountered on the LRU
  o Rebase to 2.6.35-rc5

Changelog since V2
  o Add acks and reviewed-bys
  o Do not lock multiple pages at the same time for writeback as it's unsafe
  o Drop the clean_page_list function. It alters timing with very little
    benefit. Without the contiguous writing, it doesn't do much to simplify
    the subsequent patches either
  o Throttle processes that encounter dirty pages in direct reclaim. Instead
    wakeup flusher threads to clean the number of pages encountered that were
    dirty
 
Changelog since V1
  o Merge with series that reduces stack usage in page reclaim in general
  o Allow memcg to writeback pages as they are not expected to overflow stack
  o Drop the contiguous-write patch for the moment

There is a problem in the stack depth usage of page reclaim. Particularly
during direct reclaim, it is possible to overflow the stack if it calls into
the filesystems writepage function. This patch series begins by preventing
writeback from direct reclaim and allowing btrfs and xfs to writeback from
kswapd context. As this is a potentially large change, the remainder of
the series aims to reduce any filesystem writeback from page reclaim and
depend more on background flush.

The first patch in the series is a roll-up of what should currently be
in mmotm. It's provided for convenience of testing.

Patch 2 and 3 note that it is important to distinguish between file and anon
page writeback from page reclaim as they use stack to different depths. It
updates the trace points and scripts appropriately noting which mmotm patch
they should be merged with.

Patch 4 prevents direct reclaim writing out filesystem pages while still
allowing writeback of anon pages which is in less danger of stack overflow
and doesn't have something like background flush to clean the pages.
For filesystem pages, flusher threads are asked to clean the number of
pages encountered, the caller waits on congestion and puts the pages back
on the LRU.  For lumpy reclaim, the caller will wait for a time calling the
flusher multiple times waiting on dirty pages to be written out before trying
to reclaim the dirty pages a second time. This increases the responsibility
of kswapd somewhat because it's now cleaning pages on behalf of direct
reclaimers but unlike background flushers, kswapd knows what zone pages
need to be cleaned from. As it is async IO, it should not cause kswapd to
stall (at least until the queue is congested) but the order that pages are
reclaimed on the LRU is altered. Dirty pages that would have been reclaimed
by direct reclaimers are getting another lap on the LRU. The dirty pages
could have been put on a dedicated list but this increased counter overhead
and the number of lists and it is unclear if it is necessary.

Patches 5 and 6 revert chances on XFS and btrfs that ignore writeback from
reclaim context which is a relatively recent change. extX could be modified
to allow kswapd to writeback but it is a relatively deep change. There may
be some collision with items in the filesystem git trees but it is expected
to be trivial to resolve.

Patch 7 makes background flush behave more like kupdate by syncing old or
expired inodes first as implemented by Wu Fengguang. As filesystem pages are
added onto the inactive queue and only promoted if referenced, it makes sense
to write old pages first to reduce the chances page reclaim is initiating IO.

Patch 8 notes that dirty pages can still be found at the end of the LRU.
If a number of them are encountered, it's reasonable to assume that a similar
number of dirty pages will be discovered in the very near future as that was
the dirtying pattern at the time. The patch pre-emptively kicks background
flusher to clean a number of pages creating feedback from page reclaim to
background flusher that is based on scanning rates. Christoph has described
discussions on this patch as a "band-aid" but Rik liked the idea and the
patch does have interesting results so is worth a closer look.

I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each machine
had 3G of RAM and the CPUs were

X86:	Intel P4 2 core
X86-64:	AMD Phenom 4-core
PPC64:	PPC970MP

Each used a single disk and the onboard IO controller. Dirty ratio was left
at 20. Tests on an earlier series indicated that moving to 40 did not make
much difference. The filesystem used for all tests was XFS.

Four kernels are compared.

traceonly-v4r7		is the first 3 patches of this series
nodirect-v4r7		is the first 6 patches
flusholdest-v4r7	makes background flush behave like kupdated (patch 1-7)
flushforward-v4r7	pre-emptively cleans pages when encountered on the LRU (patch 1-8)

The results on each test is broken up into two parts.  The first part is a
report based on the ftrace postprocessing script in patch 4 and reports on
direct reclaim and kswapd activity. The second part reports what percentage
of time was spent in direct reclaim and kswapd being awake.

To work out the percentage of time spent in direct reclaim, I used
/usr/bin/time to get the User + Sys CPU time. The stalled time was taken
from the post-processing script.  The total time is (User + Sys + Stall)
and obviously the percentage is of stalled over total time.

I am omitting the actual performance results simply because they are not
interesting with very few significant changes.

kernbench
=========

No writeback from reclaim initiated and no performance change of significance.

IOzone
======

No writeback from reclaim initiated and no performance change of significance.


SysBench
========

The results were based on a read/write and as the machine is under-provisioned
for the type of tests, figures are very unstable so not reported.  with
variances up to 15%. Part of the problem is that larger thread counts push
the test into swap as the memory is insufficient and destabilises results
further. I could tune for this, but it was reclaim that was important.

X86
                                 raceonly-v4r7  nodirect-v4r7 flusholdest-v4r7 flushforward-v4r7
Direct reclaims                                 18         25          6        196 
Direct reclaim pages scanned                  1615       1662        605      22233 
Direct reclaim write file async I/O             40          0          0          0 
Direct reclaim write anon async I/O              0          0         13          9 
Direct reclaim write file sync I/O               0          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0 
Wake kswapd requests                        171039     401450     313156      90960 
Kswapd wakeups                                 685        532        611        262 
Kswapd pages scanned                      14272338   12209663   13799001    5230124 
Kswapd reclaim write file async I/O         581811      23047      23795        759 
Kswapd reclaim write anon async I/O         189590     124947     114948      42906 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (ms)              0.00       0.91       0.92       1.31 
Time kswapd awake (ms)                     1079.32    1039.42    1194.82    1091.06 

User/Sys Time Running Test (seconds)       1312.24   1241.37   1308.16   1253.15
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%     0.00%
Total Elapsed Time (seconds)               8411.28   7471.15   8292.18   8170.16
Percentage Time kswapd Awake                 3.45%     0.00%     0.00%     0.00%

Dirty file pages from X86 were not much of a problem to begin with and the
patches eliminate them as expected. What is interesting is nodirct-v4r7
made such a large difference to the amount of filesystem pages that had
to be written back. Apparently, background flush must have been doing a
better job getting them cleaned in time and the direct reclaim stalls are
harmful overall. Waking background threads for dirty pages made a very large
difference to the number of pages written back. With all patches applied,
just 759 filesystem pages were written back in comparison to 581811 in the
vanilla kernel and overall the number of pages scanned was reduced.

X86-64
                                 traceonly-v4r7 nodirect-v4r7 flusholdest-v4r7 flushforward-v4r7
Direct reclaims                                795       1662       2131       6459 
Direct reclaim pages scanned                204900     127300     291647     317035 
Direct reclaim write file async I/O          53763          0          0          0 
Direct reclaim write anon async I/O           1256        730       6114         20 
Direct reclaim write file sync I/O              10          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0 
Wake kswapd requests                        690850    1457411    1713379    1648469 
Kswapd wakeups                                1683       1353       1275       1171 
Kswapd pages scanned                      17976327   15711169   16501926   12634291 
Kswapd reclaim write file async I/O         818222      26560      42081       6311 
Kswapd reclaim write anon async I/O         245442     218708     209703     205254 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (ms)             13.50      41.19      69.56      51.32 
Time kswapd awake (ms)                     2243.53    2515.34    2767.58    2607.94 

User/Sys Time Running Test (seconds)        687.69    650.83    653.28    640.38
Percentage Time Spent Direct Reclaim         0.01%     0.00%     0.00%     0.00%
Total Elapsed Time (seconds)               6954.05   6472.68   6508.28   6211.11
Percentage Time kswapd Awake                 0.04%     0.00%     0.00%     0.00%

Direct reclaim of filesystem pages is eliminated as expected. Again, the
overall number of pages that need to be written back by page reclaim is
reduced. Flushing just the oldest inode was not much of a help in terms
of how many pages needed to be written back from reclaim but pre-emptively
waking flusher threads helped a lot.

Oddly, more time was spent in direct reclaim with the patches as a greater
number of anon pages needed to be written back. It's possible this was
due to the test making more forward progress as indicated by the shorter
running time.

PPC64
                                 traceonly-v4r7 nodirect-v4r7 flusholdest-v4r7 flushforward-v4r7
Direct reclaims                               1517      34527      32365      51973 
Direct reclaim pages scanned                144496    2041199    1950282    3137493 
Direct reclaim write file async I/O          28147          0          0          0 
Direct reclaim write anon async I/O            463      25258      10894          0 
Direct reclaim write file sync I/O               7          0          0          0 
Direct reclaim write anon sync I/O               0          1          0          0 
Wake kswapd requests                       1126060    6578275    6281512    6649558 
Kswapd wakeups                                 591        262        229        247 
Kswapd pages scanned                      16522849   12277885   11076027    7614475 
Kswapd reclaim write file async I/O        1302640      50301      43308       8658 
Kswapd reclaim write anon async I/O         150876     146600     159229     134919 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (ms)             32.28     481.52     535.15     342.97 
Time kswapd awake (ms)                     1694.00    4789.76    4426.42    4309.49 

User/Sys Time Running Test (seconds)       1294.96    1264.5   1254.92   1216.92
Percentage Time Spent Direct Reclaim         0.03%     0.00%     0.00%     0.00%
Total Elapsed Time (seconds)               8876.80   8446.49   7644.95   7519.83
Percentage Time kswapd Awake                 0.05%     0.00%     0.00%     0.00%

Direct reclaim filesystem writes are eliminated but the scan rates went way
up. It implies that direct reclaim was spinning quite a bit and finding
clean pages allowing the test to complete 22 minutes faster. S Flushing
oldest inodes helped but pre-emptively waking background flushers helped
more in terms of the number of pages cleaned by page reclaim.

Stress HighAlloc
================

This test builds a large number of kernels simultaneously so that the total
workload is 1.5 times the size of RAM. It then attempts to allocate all of
RAM as huge pages. The metric is the percentage of memory allocated using
load (Pass 1), a second attempt under load (Pass 2) and when the kernel
compiles are finishes and the system is quiet (At Rest). The patches have
little impact on the success rates.

X86
                                 traceonly-v4r7 nodirect-v4r7 flusholdest-v4r7 flushforward-v4r7
Direct reclaims                                623        607        611        491 
Direct reclaim pages scanned                126515     117477     142502      91649 
Direct reclaim write file async I/O            896          0          0          0 
Direct reclaim write anon async I/O          35286      27508      35688      24819 
Direct reclaim write file sync I/O             580          0          0          0 
Direct reclaim write anon sync I/O           13932      12301      15203      11509 
Wake kswapd requests                          1561       1650       1618       1152 
Kswapd wakeups                                 183        209        211         79 
Kswapd pages scanned                       9391908    9144543   11418802    6959545 
Kswapd reclaim write file async I/O          92730       7073       8215        807 
Kswapd reclaim write anon async I/O         946499     831573    1164240     833063 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (ms)           4653.17    4193.28    5292.97    6954.96 
Time kswapd awake (ms)                     4618.67    3787.74    4856.45   55704.90 

User/Sys Time Running Test (seconds)       2103.48   2161.14      2131   2160.01
Percentage Time Spent Direct Reclaim         0.33%     0.00%     0.00%     0.00%
Total Elapsed Time (seconds)               6996.43   6405.43   7584.74   8904.53
Percentage Time kswapd Awake                 0.80%     0.00%     0.00%     0.00%

Total time running the test was increased unfortunately but this was
the only instance it occurred. Similar story as elsewhere otherwise -
filesystem direct writes are eliminated and overall filesystem writes from
page reclaim are significantly reduced to almost negligible levels (0.01%
of pages scanned by kswapd resulted in a filesystem write for the full
series in comparison to 0.99% in the vanilla kernel).

X86-64
                traceonly-v4r7     nodirect-v4r7  flusholdest-v4r7 flushforward-v4r7
Direct reclaims                               1275       1300       1222       1224 
Direct reclaim pages scanned                156940     152253     148993     148726 
Direct reclaim write file async I/O           2472          0          0          0 
Direct reclaim write anon async I/O          29281      26887      28073      26283 
Direct reclaim write file sync I/O            1943          0          0          0 
Direct reclaim write anon sync I/O           11777       9258      10256       8510 
Wake kswapd requests                          4865      12895       1185       1176 
Kswapd wakeups                                 869        757        789        822 
Kswapd pages scanned                      41664053   30419872   29602438   42603986 
Kswapd reclaim write file async I/O         550544      16092      12775       4414 
Kswapd reclaim write anon async I/O        2409931    1964446    1779486    1667076 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (ms)           8908.93    7920.53    6192.17    5926.47 
Time kswapd awake (ms)                     6045.11    5486.48    3945.35    3367.01 

User/Sys Time Running Test (seconds)       2813.44   2818.17    2801.8   2803.61
Percentage Time Spent Direct Reclaim         0.21%     0.00%     0.00%     0.00%
Total Elapsed Time (seconds)              11217.45  10286.90   8534.22   8332.84
Percentage Time kswapd Awake                 0.03%     0.00%     0.00%     0.00%

Unlike X86, total time spent on the test was significantly reduced and like
elsewhere, filesystem IO due to reclaim is way down.

PPC64
                traceonly-v4r7     nodirect-v4r7  flusholdest-v4r7 flushforward-v4r7
Direct reclaims                                665        709        652        663 
Direct reclaim pages scanned                145630     125161     116556     124718 
Direct reclaim write file async I/O            946          0          0          0 
Direct reclaim write anon async I/O          26983      23160      28531      23360 
Direct reclaim write file sync I/O             596          0          0          0 
Direct reclaim write anon sync I/O           17517      13635      16114      13121 
Wake kswapd requests                           271        302        299        278 
Kswapd wakeups                                 181        164        158        172 
Kswapd pages scanned                      68789711   68058349   54613548   64905996 
Kswapd reclaim write file async I/O         159196      20569      17538       2475 
Kswapd reclaim write anon async I/O        2311178    1962398    1811115    1829023 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (ms)          13784.95   12895.39   11132.26   11785.26 
Time kswapd awake (ms)                    13331.51   12603.74   10956.18   11479.22 

User/Sys Time Running Test (seconds)       3567.03   2730.23   2682.86   2668.08
Percentage Time Spent Direct Reclaim         0.33%     0.00%     0.00%     0.00%
Total Elapsed Time (seconds)              15282.74  14347.67  12614.61  13386.85
Percentage Time kswapd Awake                 0.08%     0.00%     0.00%     0.00%

Similar story, the test completed faster and page reclaim IO is down.

Overall, the patches seem to help. Reclaim activity is reduced while test
times are generally improved. A big concern with V3 was that direct reclaim
not being able to write pages could lead to unexpected behaviour. This
series mitigates that risk by reducing the amount of IO initiated by page
reclaim making it a rarer event.

Mel Gorman (7):
  MMOTM MARKER
  vmscan: tracing: Update trace event to track if page reclaim IO is
    for anon or file pages
  vmscan: tracing: Update post-processing script to distinguish between
    anon and file IO from page reclaim
  vmscan: Do not writeback filesystem pages in direct reclaim
  fs,btrfs: Allow kswapd to writeback pages
  fs,xfs: Allow kswapd to writeback pages
  vmscan: Kick flusher threads to clean pages when reclaim is
    encountering dirty pages

Wu Fengguang (1):
  writeback: sync old inodes first in background writeback

 .../trace/postprocess/trace-vmscan-postprocess.pl  |   89 +++++++++-----
 Makefile                                           |    2 +-
 fs/btrfs/disk-io.c                                 |   21 +----
 fs/btrfs/inode.c                                   |    6 -
 fs/fs-writeback.c                                  |   19 +++-
 fs/xfs/linux-2.6/xfs_aops.c                        |   15 ---
 include/trace/events/vmscan.h                      |    8 +-
 mm/vmscan.c                                        |  121 ++++++++++++++++++-
 8 files changed, 195 insertions(+), 86 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* [PATCH 1/8] vmscan: tracing: Roll up of patches currently in mmotm
  2010-07-19 13:11 ` Mel Gorman
@ 2010-07-19 13:11   ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli, Mel Gorman

This is a roll-up of patches currently in an unreleased mmotm tree related to
stack reduction and tracing reclaim. The patches were taken from mm-commits
traffic. It is based on 2.6.35-rc5 and included for the convenience of
testing.

No signed off required.

--- 
 .../trace/postprocess/trace-vmscan-postprocess.pl  |  654 ++++++++++++++++++++
 include/linux/memcontrol.h                         |    5 -
 include/linux/mmzone.h                             |   15 -
 include/trace/events/gfpflags.h                    |   37 ++
 include/trace/events/kmem.h                        |   38 +--
 include/trace/events/vmscan.h                      |  184 ++++++
 mm/memcontrol.c                                    |   31 -
 mm/page_alloc.c                                    |    2 -
 mm/vmscan.c                                        |  414 +++++++------
 mm/vmstat.c                                        |    2 -
 10 files changed, 1089 insertions(+), 293 deletions(-)

diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
new file mode 100644
index 0000000..d1ddc33
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -0,0 +1,654 @@
+#!/usr/bin/perl
+# This is a POC for reading the text representation of trace output related to
+# page reclaim. It makes an attempt to extract some high-level information on
+# what is going on. The accuracy of the parser may vary
+#
+# Example usage: trace-vmscan-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
+# other options
+#   --read-procstat	If the trace lacks process info, get it from /proc
+#   --ignore-pid	Aggregate processes of the same name together
+#
+# Copyright (c) IBM Corporation 2009
+# Author: Mel Gorman <mel@csn.ul.ie>
+use strict;
+use Getopt::Long;
+
+# Tracepoint events
+use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN	=> 1;
+use constant MM_VMSCAN_DIRECT_RECLAIM_END	=> 2;
+use constant MM_VMSCAN_KSWAPD_WAKE		=> 3;
+use constant MM_VMSCAN_KSWAPD_SLEEP		=> 4;
+use constant MM_VMSCAN_LRU_SHRINK_ACTIVE	=> 5;
+use constant MM_VMSCAN_LRU_SHRINK_INACTIVE	=> 6;
+use constant MM_VMSCAN_LRU_ISOLATE		=> 7;
+use constant MM_VMSCAN_WRITEPAGE_SYNC		=> 8;
+use constant MM_VMSCAN_WRITEPAGE_ASYNC		=> 9;
+use constant EVENT_UNKNOWN			=> 10;
+
+# Per-order events
+use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER => 11;
+use constant MM_VMSCAN_WAKEUP_KSWAPD_PERORDER 	=> 12;
+use constant MM_VMSCAN_KSWAPD_WAKE_PERORDER	=> 13;
+use constant HIGH_KSWAPD_REWAKEUP_PERORDER	=> 14;
+
+# Constants used to track state
+use constant STATE_DIRECT_BEGIN 		=> 15;
+use constant STATE_DIRECT_ORDER 		=> 16;
+use constant STATE_KSWAPD_BEGIN			=> 17;
+use constant STATE_KSWAPD_ORDER			=> 18;
+
+# High-level events extrapolated from tracepoints
+use constant HIGH_DIRECT_RECLAIM_LATENCY	=> 19;
+use constant HIGH_KSWAPD_LATENCY		=> 20;
+use constant HIGH_KSWAPD_REWAKEUP		=> 21;
+use constant HIGH_NR_SCANNED			=> 22;
+use constant HIGH_NR_TAKEN			=> 23;
+use constant HIGH_NR_RECLAIM			=> 24;
+use constant HIGH_NR_CONTIG_DIRTY		=> 25;
+
+my %perprocesspid;
+my %perprocess;
+my %last_procmap;
+my $opt_ignorepid;
+my $opt_read_procstat;
+
+my $total_wakeup_kswapd;
+my ($total_direct_reclaim, $total_direct_nr_scanned);
+my ($total_direct_latency, $total_kswapd_latency);
+my ($total_direct_writepage_sync, $total_direct_writepage_async);
+my ($total_kswapd_nr_scanned, $total_kswapd_wake);
+my ($total_kswapd_writepage_sync, $total_kswapd_writepage_async);
+
+# Catch sigint and exit on request
+my $sigint_report = 0;
+my $sigint_exit = 0;
+my $sigint_pending = 0;
+my $sigint_received = 0;
+sub sigint_handler {
+	my $current_time = time;
+	if ($current_time - 2 > $sigint_received) {
+		print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
+		$sigint_report = 1;
+	} else {
+		if (!$sigint_exit) {
+			print "Second SIGINT received quickly, exiting\n";
+		}
+		$sigint_exit++;
+	}
+
+	if ($sigint_exit > 3) {
+		print "Many SIGINTs received, exiting now without report\n";
+		exit;
+	}
+
+	$sigint_received = $current_time;
+	$sigint_pending = 1;
+}
+$SIG{INT} = "sigint_handler";
+
+# Parse command line options
+GetOptions(
+	'ignore-pid'	 =>	\$opt_ignorepid,
+	'read-procstat'	 =>	\$opt_read_procstat,
+);
+
+# Defaults for dynamically discovered regex's
+my $regex_direct_begin_default = 'order=([0-9]*) may_writepage=([0-9]*) gfp_flags=([A-Z_|]*)';
+my $regex_direct_end_default = 'nr_reclaimed=([0-9]*)';
+my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)';
+my $regex_kswapd_sleep_default = 'nid=([0-9]*)';
+my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
+my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)';
+my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)';
+my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
+my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) sync_io=([0-9]*)';
+
+# Dyanically discovered regex
+my $regex_direct_begin;
+my $regex_direct_end;
+my $regex_kswapd_wake;
+my $regex_kswapd_sleep;
+my $regex_wakeup_kswapd;
+my $regex_lru_isolate;
+my $regex_lru_shrink_inactive;
+my $regex_lru_shrink_active;
+my $regex_writepage;
+
+# Static regex used. Specified like this for readability and for use with /o
+#                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
+my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
+my $regex_statname = '[-0-9]*\s\((.*)\).*';
+my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
+
+sub generate_traceevent_regex {
+	my $event = shift;
+	my $default = shift;
+	my $regex;
+
+	# Read the event format or use the default
+	if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
+		print("WARNING: Event $event format string not found\n");
+		return $default;
+	} else {
+		my $line;
+		while (!eof(FORMAT)) {
+			$line = <FORMAT>;
+			$line =~ s/, REC->.*//;
+			if ($line =~ /^print fmt:\s"(.*)".*/) {
+				$regex = $1;
+				$regex =~ s/%s/\([0-9a-zA-Z|_]*\)/g;
+				$regex =~ s/%p/\([0-9a-f]*\)/g;
+				$regex =~ s/%d/\([-0-9]*\)/g;
+				$regex =~ s/%ld/\([-0-9]*\)/g;
+				$regex =~ s/%lu/\([0-9]*\)/g;
+			}
+		}
+	}
+
+	# Can't handle the print_flags stuff but in the context of this
+	# script, it really doesn't matter
+	$regex =~ s/\(REC.*\) \? __print_flags.*//;
+
+	# Verify fields are in the right order
+	my $tuple;
+	foreach $tuple (split /\s/, $regex) {
+		my ($key, $value) = split(/=/, $tuple);
+		my $expected = shift;
+		if ($key ne $expected) {
+			print("WARNING: Format not as expected for event $event '$key' != '$expected'\n");
+			$regex =~ s/$key=\((.*)\)/$key=$1/;
+		}
+	}
+
+	if (defined shift) {
+		die("Fewer fields than expected in format");
+	}
+
+	return $regex;
+}
+
+$regex_direct_begin = generate_traceevent_regex(
+			"vmscan/mm_vmscan_direct_reclaim_begin",
+			$regex_direct_begin_default,
+			"order", "may_writepage",
+			"gfp_flags");
+$regex_direct_end = generate_traceevent_regex(
+			"vmscan/mm_vmscan_direct_reclaim_end",
+			$regex_direct_end_default,
+			"nr_reclaimed");
+$regex_kswapd_wake = generate_traceevent_regex(
+			"vmscan/mm_vmscan_kswapd_wake",
+			$regex_kswapd_wake_default,
+			"nid", "order");
+$regex_kswapd_sleep = generate_traceevent_regex(
+			"vmscan/mm_vmscan_kswapd_sleep",
+			$regex_kswapd_sleep_default,
+			"nid");
+$regex_wakeup_kswapd = generate_traceevent_regex(
+			"vmscan/mm_vmscan_wakeup_kswapd",
+			$regex_wakeup_kswapd_default,
+			"nid", "zid", "order");
+$regex_lru_isolate = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_isolate",
+			$regex_lru_isolate_default,
+			"isolate_mode", "order",
+			"nr_requested", "nr_scanned", "nr_taken",
+			"contig_taken", "contig_dirty", "contig_failed");
+$regex_lru_shrink_inactive = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_shrink_inactive",
+			$regex_lru_shrink_inactive_default,
+			"nid", "zid",
+			"lru",
+			"nr_scanned", "nr_reclaimed", "priority");
+$regex_lru_shrink_active = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_shrink_active",
+			$regex_lru_shrink_active_default,
+			"nid", "zid",
+			"lru",
+			"nr_scanned", "nr_rotated", "priority");
+$regex_writepage = generate_traceevent_regex(
+			"vmscan/mm_vmscan_writepage",
+			$regex_writepage_default,
+			"page", "pfn", "sync_io");
+
+sub read_statline($) {
+	my $pid = $_[0];
+	my $statline;
+
+	if (open(STAT, "/proc/$pid/stat")) {
+		$statline = <STAT>;
+		close(STAT);
+	}
+
+	if ($statline eq '') {
+		$statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
+	}
+
+	return $statline;
+}
+
+sub guess_process_pid($$) {
+	my $pid = $_[0];
+	my $statline = $_[1];
+
+	if ($pid == 0) {
+		return "swapper-0";
+	}
+
+	if ($statline !~ /$regex_statname/o) {
+		die("Failed to math stat line for process name :: $statline");
+	}
+	return "$1-$pid";
+}
+
+# Convert sec.usec timestamp format
+sub timestamp_to_ms($) {
+	my $timestamp = $_[0];
+
+	my ($sec, $usec) = split (/\./, $timestamp);
+	return ($sec * 1000) + ($usec / 1000);
+}
+
+sub process_events {
+	my $traceevent;
+	my $process_pid;
+	my $cpus;
+	my $timestamp;
+	my $tracepoint;
+	my $details;
+	my $statline;
+
+	# Read each line of the event log
+EVENT_PROCESS:
+	while ($traceevent = <STDIN>) {
+		if ($traceevent =~ /$regex_traceevent/o) {
+			$process_pid = $1;
+			$timestamp = $3;
+			$tracepoint = $4;
+
+			$process_pid =~ /(.*)-([0-9]*)$/;
+			my $process = $1;
+			my $pid = $2;
+
+			if ($process eq "") {
+				$process = $last_procmap{$pid};
+				$process_pid = "$process-$pid";
+			}
+			$last_procmap{$pid} = $process;
+
+			if ($opt_read_procstat) {
+				$statline = read_statline($pid);
+				if ($opt_read_procstat && $process eq '') {
+					$process_pid = guess_process_pid($pid, $statline);
+				}
+			}
+		} else {
+			next;
+		}
+
+		# Perl Switch() sucks majorly
+		if ($tracepoint eq "mm_vmscan_direct_reclaim_begin") {
+			$timestamp = timestamp_to_ms($timestamp);
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}++;
+			$perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN} = $timestamp;
+
+			$details = $5;
+			if ($details !~ /$regex_direct_begin/o) {
+				print "WARNING: Failed to parse mm_vmscan_direct_reclaim_begin as expected\n";
+				print "         $details\n";
+				print "         $regex_direct_begin\n";
+				next;
+			}
+			my $order = $1;
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]++;
+			$perprocesspid{$process_pid}->{STATE_DIRECT_ORDER} = $order;
+		} elsif ($tracepoint eq "mm_vmscan_direct_reclaim_end") {
+			# Count the event itself
+			my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END};
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END}++;
+
+			# Record how long direct reclaim took this time
+			if (defined $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN}) {
+				$timestamp = timestamp_to_ms($timestamp);
+				my $order = $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER};
+				my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN});
+				$perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] = "$order-$latency";
+			}
+		} elsif ($tracepoint eq "mm_vmscan_kswapd_wake") {
+			$details = $5;
+			if ($details !~ /$regex_kswapd_wake/o) {
+				print "WARNING: Failed to parse mm_vmscan_kswapd_wake as expected\n";
+				print "         $details\n";
+				print "         $regex_kswapd_wake\n";
+				next;
+			}
+
+			my $order = $2;
+			$perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER} = $order;
+			if (!$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN}) {
+				$timestamp = timestamp_to_ms($timestamp);
+				$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}++;
+				$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = $timestamp;
+				$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]++;
+			} else {
+				$perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}++;
+				$perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order]++;
+			}
+		} elsif ($tracepoint eq "mm_vmscan_kswapd_sleep") {
+
+			# Count the event itself
+			my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP};
+			$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP}++;
+
+			# Record how long kswapd was awake
+			$timestamp = timestamp_to_ms($timestamp);
+			my $order = $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER};
+			my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN});
+			$perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index] = "$order-$latency";
+			$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = 0;
+		} elsif ($tracepoint eq "mm_vmscan_wakeup_kswapd") {
+			$perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}++;
+
+			$details = $5;
+			if ($details !~ /$regex_wakeup_kswapd/o) {
+				print "WARNING: Failed to parse mm_vmscan_wakeup_kswapd as expected\n";
+				print "         $details\n";
+				print "         $regex_wakeup_kswapd\n";
+				next;
+			}
+			my $order = $3;
+			$perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]++;
+		} elsif ($tracepoint eq "mm_vmscan_lru_isolate") {
+			$details = $5;
+			if ($details !~ /$regex_lru_isolate/o) {
+				print "WARNING: Failed to parse mm_vmscan_lru_isolate as expected\n";
+				print "         $details\n";
+				print "         $regex_lru_isolate/o\n";
+				next;
+			}
+			my $nr_scanned = $4;
+			my $nr_contig_dirty = $7;
+			$perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
+			$perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty;
+		} elsif ($tracepoint eq "mm_vmscan_writepage") {
+			$details = $5;
+			if ($details !~ /$regex_writepage/o) {
+				print "WARNING: Failed to parse mm_vmscan_writepage as expected\n";
+				print "         $details\n";
+				print "         $regex_writepage\n";
+				next;
+			}
+
+			my $sync_io = $3;
+			if ($sync_io) {
+				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}++;
+			} else {
+				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}++;
+			}
+		} else {
+			$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
+		}
+
+		if ($sigint_pending) {
+			last EVENT_PROCESS;
+		}
+	}
+}
+
+sub dump_stats {
+	my $hashref = shift;
+	my %stats = %$hashref;
+
+	# Dump per-process stats
+	my $process_pid;
+	my $max_strlen = 0;
+
+	# Get the maximum process name
+	foreach $process_pid (keys %perprocesspid) {
+		my $len = length($process_pid);
+		if ($len > $max_strlen) {
+			$max_strlen = $len;
+		}
+	}
+	$max_strlen += 2;
+
+	# Work out latencies
+	printf("\n") if !$opt_ignorepid;
+	printf("Reclaim latencies expressed as order-latency_in_ms\n") if !$opt_ignorepid;
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[0] &&
+				!$stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[0]) {
+			next;
+		}
+
+		printf "%-" . $max_strlen . "s ", $process_pid if !$opt_ignorepid;
+		my $index = 0;
+		while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] ||
+			defined $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) {
+
+			if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
+				printf("%s ", $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) if !$opt_ignorepid;
+				my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
+				$total_direct_latency += $latency;
+			} else {
+				printf("%s ", $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) if !$opt_ignorepid;
+				my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]);
+				$total_kswapd_latency += $latency;
+			}
+			$index++;
+		}
+		print "\n" if !$opt_ignorepid;
+	}
+
+	# Print out process activity
+	printf("\n");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "Process", "Direct",  "Wokeup", "Pages",   "Pages",   "Pages",     "Time");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "details", "Rclms",   "Kswapd", "Scanned", "Sync-IO", "ASync-IO",  "Stalled");
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
+			next;
+		}
+
+		$total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
+		$total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
+		$total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_direct_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$total_direct_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		my $index = 0;
+		my $this_reclaim_delay = 0;
+		while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
+			 my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
+			$this_reclaim_delay += $latency;
+			$index++;
+		}
+
+		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8u %8u %8.3f",
+			$process_pid,
+			$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
+			$stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
+			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC},
+			$this_reclaim_delay / 1000);
+
+		if ($stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
+				if ($count != 0) {
+					print "direct-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
+				if ($count != 0) {
+					print "wakeup-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY}) {
+			print "      ";
+			my $count = $stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY};
+			if ($count != 0) {
+				print "contig-dirty=$count ";
+			}
+		}
+
+		print "\n";
+	}
+
+	# Print out kswapd activity
+	printf("\n");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Kswapd",   "Kswapd",  "Order",     "Pages",   "Pages",  "Pages");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Sync-IO", "ASync-IO");
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
+			next;
+		}
+
+		$total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
+		$total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_kswapd_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$total_kswapd_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8i %8u",
+			$process_pid,
+			$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
+			$stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
+			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC});
+
+		if ($stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
+				if ($count != 0) {
+					print "wake-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order];
+				if ($count != 0) {
+					print "rewake-$order=$count ";
+				}
+			}
+		}
+		printf("\n");
+	}
+
+	# Print out summaries
+	$total_direct_latency /= 1000;
+	$total_kswapd_latency /= 1000;
+	print "\nSummary\n";
+	print "Direct reclaims:     		$total_direct_reclaim\n";
+	print "Direct reclaim pages scanned:	$total_direct_nr_scanned\n";
+	print "Direct reclaim write sync I/O:	$total_direct_writepage_sync\n";
+	print "Direct reclaim write async I/O:	$total_direct_writepage_async\n";
+	print "Wake kswapd requests:		$total_wakeup_kswapd\n";
+	printf "Time stalled direct reclaim: 	%-1.2f ms\n", $total_direct_latency;
+	print "\n";
+	print "Kswapd wakeups:			$total_kswapd_wake\n";
+	print "Kswapd pages scanned:		$total_kswapd_nr_scanned\n";
+	print "Kswapd reclaim write sync I/O:	$total_kswapd_writepage_sync\n";
+	print "Kswapd reclaim write async I/O:	$total_kswapd_writepage_async\n";
+	printf "Time kswapd awake:		%-1.2f ms\n", $total_kswapd_latency;
+}
+
+sub aggregate_perprocesspid() {
+	my $process_pid;
+	my $process;
+	undef %perprocess;
+
+	foreach $process_pid (keys %perprocesspid) {
+		$process = $process_pid;
+		$process =~ s/-([0-9])*$//;
+		if ($process eq '') {
+			$process = "NO_PROCESS_NAME";
+		}
+
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN} += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
+		$perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE} += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
+		$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
+		$perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
+		$perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		for (my $order = 0; $order < 20; $order++) {
+			$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
+			$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
+			$perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
+
+		}
+
+		# Aggregate direct reclaim latencies
+		my $wr_index = $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END};
+		my $rd_index = 0;
+		while (defined $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index]) {
+			$perprocess{$process}->{HIGH_DIRECT_RECLAIM_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index];
+			$rd_index++;
+			$wr_index++;
+		}
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
+
+		# Aggregate kswapd latencies
+		my $wr_index = $perprocess{$process}->{MM_VMSCAN_KSWAPD_SLEEP};
+		my $rd_index = 0;
+		while (defined $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index]) {
+			$perprocess{$process}->{HIGH_KSWAPD_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index];
+			$rd_index++;
+			$wr_index++;
+		}
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
+	}
+}
+
+sub report() {
+	if (!$opt_ignorepid) {
+		dump_stats(\%perprocesspid);
+	} else {
+		aggregate_perprocesspid();
+		dump_stats(\%perprocess);
+	}
+}
+
+# Process events or signals until neither is available
+sub signal_loop() {
+	my $sigint_processed;
+	do {
+		$sigint_processed = 0;
+		process_events();
+
+		# Handle pending signals if any
+		if ($sigint_pending) {
+			my $current_time = time;
+
+			if ($sigint_exit) {
+				print "Received exit signal\n";
+				$sigint_pending = 0;
+			}
+			if ($sigint_report) {
+				if ($current_time >= $sigint_received + 2) {
+					report();
+					$sigint_report = 0;
+					$sigint_pending = 0;
+					$sigint_processed = 1;
+				}
+			}
+		}
+	} while ($sigint_pending || $sigint_processed);
+}
+
+signal_loop();
+report();
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9411d32..9f1afd3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -98,11 +98,6 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
 /*
  * For memory reclaim.
  */
-extern int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem);
-extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
-							int priority);
-extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
-							int priority);
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4d109e..b578eee 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -348,21 +348,6 @@ struct zone {
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 
 	/*
-	 * prev_priority holds the scanning priority for this zone.  It is
-	 * defined as the scanning priority at which we achieved our reclaim
-	 * target at the previous try_to_free_pages() or balance_pgdat()
-	 * invocation.
-	 *
-	 * We use prev_priority as a measure of how much stress page reclaim is
-	 * under - it drives the swappiness decision: whether to unmap mapped
-	 * pages.
-	 *
-	 * Access to both this field is quite racy even on uniprocessor.  But
-	 * it is expected to average out OK.
-	 */
-	int prev_priority;
-
-	/*
 	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
 	 * this zone's LRU.  Maintained by the pageout code.
 	 */
diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
new file mode 100644
index 0000000..e3615c0
--- /dev/null
+++ b/include/trace/events/gfpflags.h
@@ -0,0 +1,37 @@
+/*
+ * The order of these masks is important. Matching masks will be seen
+ * first and the left over flags will end up showing by themselves.
+ *
+ * For example, if we have GFP_KERNEL before GFP_USER we wil get:
+ *
+ *  GFP_KERNEL|GFP_HARDWALL
+ *
+ * Thus most bits set go first.
+ */
+#define show_gfp_flags(flags)						\
+	(flags) ? __print_flags(flags, "|",				\
+	{(unsigned long)GFP_HIGHUSER_MOVABLE,	"GFP_HIGHUSER_MOVABLE"}, \
+	{(unsigned long)GFP_HIGHUSER,		"GFP_HIGHUSER"},	\
+	{(unsigned long)GFP_USER,		"GFP_USER"},		\
+	{(unsigned long)GFP_TEMPORARY,		"GFP_TEMPORARY"},	\
+	{(unsigned long)GFP_KERNEL,		"GFP_KERNEL"},		\
+	{(unsigned long)GFP_NOFS,		"GFP_NOFS"},		\
+	{(unsigned long)GFP_ATOMIC,		"GFP_ATOMIC"},		\
+	{(unsigned long)GFP_NOIO,		"GFP_NOIO"},		\
+	{(unsigned long)__GFP_HIGH,		"GFP_HIGH"},		\
+	{(unsigned long)__GFP_WAIT,		"GFP_WAIT"},		\
+	{(unsigned long)__GFP_IO,		"GFP_IO"},		\
+	{(unsigned long)__GFP_COLD,		"GFP_COLD"},		\
+	{(unsigned long)__GFP_NOWARN,		"GFP_NOWARN"},		\
+	{(unsigned long)__GFP_REPEAT,		"GFP_REPEAT"},		\
+	{(unsigned long)__GFP_NOFAIL,		"GFP_NOFAIL"},		\
+	{(unsigned long)__GFP_NORETRY,		"GFP_NORETRY"},		\
+	{(unsigned long)__GFP_COMP,		"GFP_COMP"},		\
+	{(unsigned long)__GFP_ZERO,		"GFP_ZERO"},		\
+	{(unsigned long)__GFP_NOMEMALLOC,	"GFP_NOMEMALLOC"},	\
+	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
+	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
+	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
+	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"}		\
+	) : "GFP_NOWAIT"
+
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 3adca0c..a9c87ad 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -6,43 +6,7 @@
 
 #include <linux/types.h>
 #include <linux/tracepoint.h>
-
-/*
- * The order of these masks is important. Matching masks will be seen
- * first and the left over flags will end up showing by themselves.
- *
- * For example, if we have GFP_KERNEL before GFP_USER we wil get:
- *
- *  GFP_KERNEL|GFP_HARDWALL
- *
- * Thus most bits set go first.
- */
-#define show_gfp_flags(flags)						\
-	(flags) ? __print_flags(flags, "|",				\
-	{(unsigned long)GFP_HIGHUSER_MOVABLE,	"GFP_HIGHUSER_MOVABLE"}, \
-	{(unsigned long)GFP_HIGHUSER,		"GFP_HIGHUSER"},	\
-	{(unsigned long)GFP_USER,		"GFP_USER"},		\
-	{(unsigned long)GFP_TEMPORARY,		"GFP_TEMPORARY"},	\
-	{(unsigned long)GFP_KERNEL,		"GFP_KERNEL"},		\
-	{(unsigned long)GFP_NOFS,		"GFP_NOFS"},		\
-	{(unsigned long)GFP_ATOMIC,		"GFP_ATOMIC"},		\
-	{(unsigned long)GFP_NOIO,		"GFP_NOIO"},		\
-	{(unsigned long)__GFP_HIGH,		"GFP_HIGH"},		\
-	{(unsigned long)__GFP_WAIT,		"GFP_WAIT"},		\
-	{(unsigned long)__GFP_IO,		"GFP_IO"},		\
-	{(unsigned long)__GFP_COLD,		"GFP_COLD"},		\
-	{(unsigned long)__GFP_NOWARN,		"GFP_NOWARN"},		\
-	{(unsigned long)__GFP_REPEAT,		"GFP_REPEAT"},		\
-	{(unsigned long)__GFP_NOFAIL,		"GFP_NOFAIL"},		\
-	{(unsigned long)__GFP_NORETRY,		"GFP_NORETRY"},		\
-	{(unsigned long)__GFP_COMP,		"GFP_COMP"},		\
-	{(unsigned long)__GFP_ZERO,		"GFP_ZERO"},		\
-	{(unsigned long)__GFP_NOMEMALLOC,	"GFP_NOMEMALLOC"},	\
-	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
-	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
-	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
-	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"}		\
-	) : "GFP_NOWAIT"
+#include "gfpflags.h"
 
 DECLARE_EVENT_CLASS(kmem_alloc,
 
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
new file mode 100644
index 0000000..f2da66a
--- /dev/null
+++ b/include/trace/events/vmscan.h
@@ -0,0 +1,184 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vmscan
+
+#if !defined(_TRACE_VMSCAN_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VMSCAN_H
+
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+#include "gfpflags.h"
+
+TRACE_EVENT(mm_vmscan_kswapd_sleep,
+
+	TP_PROTO(int nid),
+
+	TP_ARGS(nid),
+
+	TP_STRUCT__entry(
+		__field(	int,	nid	)
+	),
+
+	TP_fast_assign(
+		__entry->nid	= nid;
+	),
+
+	TP_printk("nid=%d", __entry->nid)
+);
+
+TRACE_EVENT(mm_vmscan_kswapd_wake,
+
+	TP_PROTO(int nid, int order),
+
+	TP_ARGS(nid, order),
+
+	TP_STRUCT__entry(
+		__field(	int,	nid	)
+		__field(	int,	order	)
+	),
+
+	TP_fast_assign(
+		__entry->nid	= nid;
+		__entry->order	= order;
+	),
+
+	TP_printk("nid=%d order=%d", __entry->nid, __entry->order)
+);
+
+TRACE_EVENT(mm_vmscan_wakeup_kswapd,
+
+	TP_PROTO(int nid, int zid, int order),
+
+	TP_ARGS(nid, zid, order),
+
+	TP_STRUCT__entry(
+		__field(	int,		nid	)
+		__field(	int,		zid	)
+		__field(	int,		order	)
+	),
+
+	TP_fast_assign(
+		__entry->nid		= nid;
+		__entry->zid		= zid;
+		__entry->order		= order;
+	),
+
+	TP_printk("nid=%d zid=%d order=%d",
+		__entry->nid,
+		__entry->zid,
+		__entry->order)
+);
+
+TRACE_EVENT(mm_vmscan_direct_reclaim_begin,
+
+	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags),
+
+	TP_ARGS(order, may_writepage, gfp_flags),
+
+	TP_STRUCT__entry(
+		__field(	int,	order		)
+		__field(	int,	may_writepage	)
+		__field(	gfp_t,	gfp_flags	)
+	),
+
+	TP_fast_assign(
+		__entry->order		= order;
+		__entry->may_writepage	= may_writepage;
+		__entry->gfp_flags	= gfp_flags;
+	),
+
+	TP_printk("order=%d may_writepage=%d gfp_flags=%s",
+		__entry->order,
+		__entry->may_writepage,
+		show_gfp_flags(__entry->gfp_flags))
+);
+
+TRACE_EVENT(mm_vmscan_direct_reclaim_end,
+
+	TP_PROTO(unsigned long nr_reclaimed),
+
+	TP_ARGS(nr_reclaimed),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	nr_reclaimed	)
+	),
+
+	TP_fast_assign(
+		__entry->nr_reclaimed	= nr_reclaimed;
+	),
+
+	TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed)
+);
+
+TRACE_EVENT(mm_vmscan_lru_isolate,
+
+	TP_PROTO(int order,
+		unsigned long nr_requested,
+		unsigned long nr_scanned,
+		unsigned long nr_taken,
+		unsigned long nr_lumpy_taken,
+		unsigned long nr_lumpy_dirty,
+		unsigned long nr_lumpy_failed,
+		int isolate_mode),
+
+	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, isolate_mode),
+
+	TP_STRUCT__entry(
+		__field(int, order)
+		__field(unsigned long, nr_requested)
+		__field(unsigned long, nr_scanned)
+		__field(unsigned long, nr_taken)
+		__field(unsigned long, nr_lumpy_taken)
+		__field(unsigned long, nr_lumpy_dirty)
+		__field(unsigned long, nr_lumpy_failed)
+		__field(int, isolate_mode)
+	),
+
+	TP_fast_assign(
+		__entry->order = order;
+		__entry->nr_requested = nr_requested;
+		__entry->nr_scanned = nr_scanned;
+		__entry->nr_taken = nr_taken;
+		__entry->nr_lumpy_taken = nr_lumpy_taken;
+		__entry->nr_lumpy_dirty = nr_lumpy_dirty;
+		__entry->nr_lumpy_failed = nr_lumpy_failed;
+		__entry->isolate_mode = isolate_mode;
+	),
+
+	TP_printk("isolate_mode=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu contig_taken=%lu contig_dirty=%lu contig_failed=%lu",
+		__entry->isolate_mode,
+		__entry->order,
+		__entry->nr_requested,
+		__entry->nr_scanned,
+		__entry->nr_taken,
+		__entry->nr_lumpy_taken,
+		__entry->nr_lumpy_dirty,
+		__entry->nr_lumpy_failed)
+);
+
+TRACE_EVENT(mm_vmscan_writepage,
+
+	TP_PROTO(struct page *page,
+		int sync_io),
+
+	TP_ARGS(page, sync_io),
+
+	TP_STRUCT__entry(
+		__field(struct page *, page)
+		__field(int, sync_io)
+	),
+
+	TP_fast_assign(
+		__entry->page = page;
+		__entry->sync_io = sync_io;
+	),
+
+	TP_printk("page=%p pfn=%lu sync_io=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->sync_io)
+);
+
+#endif /* _TRACE_VMSCAN_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 20a8193..31abd1c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -211,8 +211,6 @@ struct mem_cgroup {
 	*/
 	spinlock_t reclaim_param_lock;
 
-	int	prev_priority;	/* for recording reclaim priority */
-
 	/*
 	 * While reclaiming in a hierarchy, we cache the last child we
 	 * reclaimed from.
@@ -858,35 +856,6 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
 	return ret;
 }
 
-/*
- * prev_priority control...this will be used in memory reclaim path.
- */
-int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem)
-{
-	int prev_priority;
-
-	spin_lock(&mem->reclaim_param_lock);
-	prev_priority = mem->prev_priority;
-	spin_unlock(&mem->reclaim_param_lock);
-
-	return prev_priority;
-}
-
-void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem, int priority)
-{
-	spin_lock(&mem->reclaim_param_lock);
-	if (priority < mem->prev_priority)
-		mem->prev_priority = priority;
-	spin_unlock(&mem->reclaim_param_lock);
-}
-
-void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, int priority)
-{
-	spin_lock(&mem->reclaim_param_lock);
-	mem->prev_priority = priority;
-	spin_unlock(&mem->reclaim_param_lock);
-}
-
 static int calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_pages)
 {
 	unsigned long active;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 431214b..0b0b629 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4081,8 +4081,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		zone_seqlock_init(zone);
 		zone->zone_pgdat = pgdat;
 
-		zone->prev_priority = DEF_PRIORITY;
-
 		zone_pcp_init(zone);
 		for_each_lru(l) {
 			INIT_LIST_HEAD(&zone->lru[l].list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c7e57c..e6ddba9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -48,6 +48,9 @@
 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/vmscan.h>
+
 struct scan_control {
 	/* Incremented by the number of inactive pages that were scanned */
 	unsigned long nr_scanned;
@@ -290,13 +293,13 @@ static int may_write_to_queue(struct backing_dev_info *bdi)
  * prevents it from being freed up.  But we have a ref on the page and once
  * that page is locked, the mapping is pinned.
  *
- * We're allowed to run sleeping lock_page() here because we know the caller has
- * __GFP_FS.
+ * We're allowed to run sleeping lock_page_nosync() here because we know the
+ * caller has __GFP_FS.
  */
 static void handle_write_error(struct address_space *mapping,
 				struct page *page, int error)
 {
-	lock_page(page);
+	lock_page_nosync(page);
 	if (page_mapping(page) == mapping)
 		mapping_set_error(mapping, error);
 	unlock_page(page);
@@ -396,6 +399,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			/* synchronous write or broken a_ops? */
 			ClearPageReclaim(page);
 		}
+		trace_mm_vmscan_writepage(page,
+			sync_writeback == PAGEOUT_IO_SYNC);
 		inc_zone_page_state(page, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
 	}
@@ -615,6 +620,24 @@ static enum page_references page_check_references(struct page *page,
 	return PAGEREF_RECLAIM;
 }
 
+static noinline_for_stack void free_page_list(struct list_head *free_pages)
+{
+	struct pagevec freed_pvec;
+	struct page *page, *tmp;
+
+	pagevec_init(&freed_pvec, 1);
+
+	list_for_each_entry_safe(page, tmp, free_pages, lru) {
+		list_del(&page->lru);
+		if (!pagevec_add(&freed_pvec, page)) {
+			__pagevec_free(&freed_pvec);
+			pagevec_reinit(&freed_pvec);
+		}
+	}
+
+	pagevec_free(&freed_pvec);
+}
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -623,13 +646,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 					enum pageout_io sync_writeback)
 {
 	LIST_HEAD(ret_pages);
-	struct pagevec freed_pvec;
+	LIST_HEAD(free_pages);
 	int pgactivate = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
 
-	pagevec_init(&freed_pvec, 1);
 	while (!list_empty(page_list)) {
 		enum page_references references;
 		struct address_space *mapping;
@@ -804,10 +826,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		__clear_page_locked(page);
 free_it:
 		nr_reclaimed++;
-		if (!pagevec_add(&freed_pvec, page)) {
-			__pagevec_free(&freed_pvec);
-			pagevec_reinit(&freed_pvec);
-		}
+
+		/*
+		 * Is there need to periodically free_page_list? It would
+		 * appear not as the counts should be low
+		 */
+		list_add(&page->lru, &free_pages);
 		continue;
 
 cull_mlocked:
@@ -830,9 +854,10 @@ keep:
 		list_add(&page->lru, &ret_pages);
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
+
+	free_page_list(&free_pages);
+
 	list_splice(&ret_pages, page_list);
-	if (pagevec_count(&freed_pvec))
-		__pagevec_free(&freed_pvec);
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
@@ -914,6 +939,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		unsigned long *scanned, int order, int mode, int file)
 {
 	unsigned long nr_taken = 0;
+	unsigned long nr_lumpy_taken = 0, nr_lumpy_dirty = 0, nr_lumpy_failed = 0;
 	unsigned long scan;
 
 	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
@@ -991,12 +1017,25 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 				list_move(&cursor_page->lru, dst);
 				mem_cgroup_del_lru(cursor_page);
 				nr_taken++;
+				nr_lumpy_taken++;
+				if (PageDirty(cursor_page))
+					nr_lumpy_dirty++;
 				scan++;
+			} else {
+				if (mode == ISOLATE_BOTH &&
+						page_count(cursor_page))
+					nr_lumpy_failed++;
 			}
 		}
 	}
 
 	*scanned = scan;
+
+	trace_mm_vmscan_lru_isolate(order,
+			nr_to_scan, scan,
+			nr_taken,
+			nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed,
+			mode);
 	return nr_taken;
 }
 
@@ -1033,7 +1072,8 @@ static unsigned long clear_active_flags(struct list_head *page_list,
 			ClearPageActive(page);
 			nr_active++;
 		}
-		count[lru]++;
+		if (count)
+			count[lru]++;
 	}
 
 	return nr_active;
@@ -1110,174 +1150,177 @@ static int too_many_isolated(struct zone *zone, int file,
 }
 
 /*
- * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
- * of reclaimed pages
+ * TODO: Try merging with migrations version of putback_lru_pages
  */
-static unsigned long shrink_inactive_list(unsigned long max_scan,
-			struct zone *zone, struct scan_control *sc,
-			int priority, int file)
+static noinline_for_stack void
+putback_lru_pages(struct zone *zone, struct scan_control *sc,
+				unsigned long nr_anon, unsigned long nr_file,
+				struct list_head *page_list)
 {
-	LIST_HEAD(page_list);
+	struct page *page;
 	struct pagevec pvec;
-	unsigned long nr_scanned = 0;
-	unsigned long nr_reclaimed = 0;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 
-	while (unlikely(too_many_isolated(zone, file, sc))) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+	pagevec_init(&pvec, 1);
 
-		/* We are about to die and free our memory. Return now. */
-		if (fatal_signal_pending(current))
-			return SWAP_CLUSTER_MAX;
+	/*
+	 * Put back any unfreeable pages.
+	 */
+	spin_lock(&zone->lru_lock);
+	while (!list_empty(page_list)) {
+		int lru;
+		page = lru_to_page(page_list);
+		VM_BUG_ON(PageLRU(page));
+		list_del(&page->lru);
+		if (unlikely(!page_evictable(page, NULL))) {
+			spin_unlock_irq(&zone->lru_lock);
+			putback_lru_page(page);
+			spin_lock_irq(&zone->lru_lock);
+			continue;
+		}
+		SetPageLRU(page);
+		lru = page_lru(page);
+		add_page_to_lru_list(zone, page, lru);
+		if (is_active_lru(lru)) {
+			int file = is_file_lru(lru);
+			reclaim_stat->recent_rotated[file]++;
+		}
+		if (!pagevec_add(&pvec, page)) {
+			spin_unlock_irq(&zone->lru_lock);
+			__pagevec_release(&pvec);
+			spin_lock_irq(&zone->lru_lock);
+		}
 	}
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
 
+	spin_unlock_irq(&zone->lru_lock);
+	pagevec_release(&pvec);
+}
 
-	pagevec_init(&pvec, 1);
+static noinline_for_stack void update_isolated_counts(struct zone *zone,
+					struct scan_control *sc,
+					unsigned long *nr_anon,
+					unsigned long *nr_file,
+					struct list_head *isolated_list)
+{
+	unsigned long nr_active;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 
-	lru_add_drain();
-	spin_lock_irq(&zone->lru_lock);
-	do {
-		struct page *page;
-		unsigned long nr_taken;
-		unsigned long nr_scan;
-		unsigned long nr_freed;
-		unsigned long nr_active;
-		unsigned int count[NR_LRU_LISTS] = { 0, };
-		int mode = sc->lumpy_reclaim_mode ? ISOLATE_BOTH : ISOLATE_INACTIVE;
-		unsigned long nr_anon;
-		unsigned long nr_file;
+	nr_active = clear_active_flags(isolated_list, count);
+	__count_vm_events(PGDEACTIVATE, nr_active);
 
-		if (scanning_global_lru(sc)) {
-			nr_taken = isolate_pages_global(SWAP_CLUSTER_MAX,
-							&page_list, &nr_scan,
-							sc->order, mode,
-							zone, 0, file);
-			zone->pages_scanned += nr_scan;
-			if (current_is_kswapd())
-				__count_zone_vm_events(PGSCAN_KSWAPD, zone,
-						       nr_scan);
-			else
-				__count_zone_vm_events(PGSCAN_DIRECT, zone,
-						       nr_scan);
-		} else {
-			nr_taken = mem_cgroup_isolate_pages(SWAP_CLUSTER_MAX,
-							&page_list, &nr_scan,
-							sc->order, mode,
-							zone, sc->mem_cgroup,
-							0, file);
-			/*
-			 * mem_cgroup_isolate_pages() keeps track of
-			 * scanned pages on its own.
-			 */
-		}
+	__mod_zone_page_state(zone, NR_ACTIVE_FILE,
+			      -count[LRU_ACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_INACTIVE_FILE,
+			      -count[LRU_INACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_ACTIVE_ANON,
+			      -count[LRU_ACTIVE_ANON]);
+	__mod_zone_page_state(zone, NR_INACTIVE_ANON,
+			      -count[LRU_INACTIVE_ANON]);
 
-		if (nr_taken == 0)
-			goto done;
+	*nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+	*nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, *nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, *nr_file);
 
-		nr_active = clear_active_flags(&page_list, count);
-		__count_vm_events(PGDEACTIVATE, nr_active);
+	reclaim_stat->recent_scanned[0] += *nr_anon;
+	reclaim_stat->recent_scanned[1] += *nr_file;
+}
 
-		__mod_zone_page_state(zone, NR_ACTIVE_FILE,
-						-count[LRU_ACTIVE_FILE]);
-		__mod_zone_page_state(zone, NR_INACTIVE_FILE,
-						-count[LRU_INACTIVE_FILE]);
-		__mod_zone_page_state(zone, NR_ACTIVE_ANON,
-						-count[LRU_ACTIVE_ANON]);
-		__mod_zone_page_state(zone, NR_INACTIVE_ANON,
-						-count[LRU_INACTIVE_ANON]);
+/*
+ * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
+ * of reclaimed pages
+ */
+static noinline_for_stack unsigned long
+shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
+			struct scan_control *sc, int priority, int file)
+{
+	LIST_HEAD(page_list);
+	unsigned long nr_scanned;
+	unsigned long nr_reclaimed = 0;
+	unsigned long nr_taken;
+	unsigned long nr_active;
+	unsigned long nr_anon;
+	unsigned long nr_file;
 
-		nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
-		nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
-		__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
-		__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
+	while (unlikely(too_many_isolated(zone, file, sc))) {
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
-		reclaim_stat->recent_scanned[0] += nr_anon;
-		reclaim_stat->recent_scanned[1] += nr_file;
+		/* We are about to die and free our memory. Return now. */
+		if (fatal_signal_pending(current))
+			return SWAP_CLUSTER_MAX;
+	}
 
-		spin_unlock_irq(&zone->lru_lock);
 
-		nr_scanned += nr_scan;
-		nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+	lru_add_drain();
+	spin_lock_irq(&zone->lru_lock);
 
+	if (scanning_global_lru(sc)) {
+		nr_taken = isolate_pages_global(nr_to_scan,
+			&page_list, &nr_scanned, sc->order,
+			sc->lumpy_reclaim_mode ?
+				ISOLATE_BOTH : ISOLATE_INACTIVE,
+			zone, 0, file);
+		zone->pages_scanned += nr_scanned;
+		if (current_is_kswapd())
+			__count_zone_vm_events(PGSCAN_KSWAPD, zone,
+					       nr_scanned);
+		else
+			__count_zone_vm_events(PGSCAN_DIRECT, zone,
+					       nr_scanned);
+	} else {
+		nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
+			&page_list, &nr_scanned, sc->order,
+			sc->lumpy_reclaim_mode ?
+				ISOLATE_BOTH : ISOLATE_INACTIVE,
+			zone, sc->mem_cgroup,
+			0, file);
 		/*
-		 * If we are direct reclaiming for contiguous pages and we do
-		 * not reclaim everything in the list, try again and wait
-		 * for IO to complete. This will stall high-order allocations
-		 * but that should be acceptable to the caller
+		 * mem_cgroup_isolate_pages() keeps track of
+		 * scanned pages on its own.
 		 */
-		if (nr_freed < nr_taken && !current_is_kswapd() &&
-		    sc->lumpy_reclaim_mode) {
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+	}
 
-			/*
-			 * The attempt at page out may have made some
-			 * of the pages active, mark them inactive again.
-			 */
-			nr_active = clear_active_flags(&page_list, count);
-			count_vm_events(PGDEACTIVATE, nr_active);
+	if (nr_taken == 0) {
+		spin_unlock_irq(&zone->lru_lock);
+		return 0;
+	}
 
-			nr_freed += shrink_page_list(&page_list, sc,
-							PAGEOUT_IO_SYNC);
-		}
+	update_isolated_counts(zone, sc, &nr_anon, &nr_file, &page_list);
 
-		nr_reclaimed += nr_freed;
+	spin_unlock_irq(&zone->lru_lock);
 
-		local_irq_disable();
-		if (current_is_kswapd())
-			__count_vm_events(KSWAPD_STEAL, nr_freed);
-		__count_zone_vm_events(PGSTEAL, zone, nr_freed);
+	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+
+	/*
+	 * If we are direct reclaiming for contiguous pages and we do
+	 * not reclaim everything in the list, try again and wait
+	 * for IO to complete. This will stall high-order allocations
+	 * but that should be acceptable to the caller
+	 */
+	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
+			sc->lumpy_reclaim_mode) {
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
-		spin_lock(&zone->lru_lock);
 		/*
-		 * Put back any unfreeable pages.
+		 * The attempt at page out may have made some
+		 * of the pages active, mark them inactive again.
 		 */
-		while (!list_empty(&page_list)) {
-			int lru;
-			page = lru_to_page(&page_list);
-			VM_BUG_ON(PageLRU(page));
-			list_del(&page->lru);
-			if (unlikely(!page_evictable(page, NULL))) {
-				spin_unlock_irq(&zone->lru_lock);
-				putback_lru_page(page);
-				spin_lock_irq(&zone->lru_lock);
-				continue;
-			}
-			SetPageLRU(page);
-			lru = page_lru(page);
-			add_page_to_lru_list(zone, page, lru);
-			if (is_active_lru(lru)) {
-				int file = is_file_lru(lru);
-				reclaim_stat->recent_rotated[file]++;
-			}
-			if (!pagevec_add(&pvec, page)) {
-				spin_unlock_irq(&zone->lru_lock);
-				__pagevec_release(&pvec);
-				spin_lock_irq(&zone->lru_lock);
-			}
-		}
-		__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
-		__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
+		nr_active = clear_active_flags(&page_list, NULL);
+		count_vm_events(PGDEACTIVATE, nr_active);
 
-  	} while (nr_scanned < max_scan);
+		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+	}
 
-done:
-	spin_unlock_irq(&zone->lru_lock);
-	pagevec_release(&pvec);
-	return nr_reclaimed;
-}
+	local_irq_disable();
+	if (current_is_kswapd())
+		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
+	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-/*
- * We are about to scan this zone at a certain priority level.  If that priority
- * level is smaller (ie: more urgent) than the previous priority, then note
- * that priority level within the zone.  This is done so that when the next
- * process comes in to scan this zone, it will immediately start out at this
- * priority level rather than having to build up its own scanning priority.
- * Here, this priority affects only the reclaim-mapped threshold.
- */
-static inline void note_zone_scanning_priority(struct zone *zone, int priority)
-{
-	if (priority < zone->prev_priority)
-		zone->prev_priority = priority;
+	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
+	return nr_reclaimed;
 }
 
 /*
@@ -1727,13 +1770,12 @@ static void shrink_zone(int priority, struct zone *zone,
 static bool shrink_zones(int priority, struct zonelist *zonelist,
 					struct scan_control *sc)
 {
-	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	struct zoneref *z;
 	struct zone *zone;
 	bool all_unreclaimable = true;
 
-	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
-					sc->nodemask) {
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
 			continue;
 		/*
@@ -1743,17 +1785,8 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
 		if (scanning_global_lru(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
-			note_zone_scanning_priority(zone, priority);
-
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;	/* Let kswapd poll it */
-		} else {
-			/*
-			 * Ignore cpuset limitation here. We just want to reduce
-			 * # of used pages by us regardless of memory shortage.
-			 */
-			mem_cgroup_note_reclaim_priority(sc->mem_cgroup,
-							priority);
 		}
 
 		shrink_zone(priority, zone, sc);
@@ -1788,7 +1821,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	unsigned long lru_pages = 0;
 	struct zoneref *z;
 	struct zone *zone;
-	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	unsigned long writeback_threshold;
 
 	get_mems_allowed();
@@ -1800,7 +1832,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	 * mem_cgroup will not do shrink_slab.
 	 */
 	if (scanning_global_lru(sc)) {
-		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+		for_each_zone_zonelist(zone, z, zonelist,
+				gfp_zone(sc->gfp_mask)) {
 
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
@@ -1859,17 +1892,6 @@ out:
 	if (priority < 0)
 		priority = 0;
 
-	if (scanning_global_lru(sc)) {
-		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-
-			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-				continue;
-
-			zone->prev_priority = priority;
-		}
-	} else
-		mem_cgroup_record_reclaim_priority(sc->mem_cgroup, priority);
-
 	delayacct_freepages_end();
 	put_mems_allowed();
 
@@ -1886,6 +1908,7 @@ out:
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 				gfp_t gfp_mask, nodemask_t *nodemask)
 {
+	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
 		.may_writepage = !laptop_mode,
@@ -1898,7 +1921,15 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.nodemask = nodemask,
 	};
 
-	return do_try_to_free_pages(zonelist, &sc);
+	trace_mm_vmscan_direct_reclaim_begin(order,
+				sc.may_writepage,
+				gfp_mask);
+
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+
+	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
+
+	return nr_reclaimed;
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
@@ -2026,22 +2057,12 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 		.order = order,
 		.mem_cgroup = NULL,
 	};
-	/*
-	 * temp_priority is used to remember the scanning priority at which
-	 * this zone was successfully refilled to
-	 * free_pages == high_wmark_pages(zone).
-	 */
-	int temp_priority[MAX_NR_ZONES];
-
 loop_again:
 	total_scanned = 0;
 	sc.nr_reclaimed = 0;
 	sc.may_writepage = !laptop_mode;
 	count_vm_event(PAGEOUTRUN);
 
-	for (i = 0; i < pgdat->nr_zones; i++)
-		temp_priority[i] = DEF_PRIORITY;
-
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 		unsigned long lru_pages = 0;
@@ -2109,9 +2130,7 @@ loop_again:
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;
 
-			temp_priority[i] = priority;
 			sc.nr_scanned = 0;
-			note_zone_scanning_priority(zone, priority);
 
 			nid = pgdat->node_id;
 			zid = zone_idx(zone);
@@ -2184,16 +2203,6 @@ loop_again:
 			break;
 	}
 out:
-	/*
-	 * Note within each zone the priority level at which this zone was
-	 * brought into a happy state.  So that the next thread which scans this
-	 * zone will start out at that priority level.
-	 */
-	for (i = 0; i < pgdat->nr_zones; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
-		zone->prev_priority = temp_priority[i];
-	}
 	if (!all_zones_ok) {
 		cond_resched();
 
@@ -2297,9 +2306,10 @@ static int kswapd(void *p)
 				 * premature sleep. If not, then go fully
 				 * to sleep until explicitly woken up
 				 */
-				if (!sleeping_prematurely(pgdat, order, remaining))
+				if (!sleeping_prematurely(pgdat, order, remaining)) {
+					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 					schedule();
-				else {
+				} else {
 					if (remaining)
 						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
 					else
@@ -2319,8 +2329,10 @@ static int kswapd(void *p)
 		 * We can speed up thawing tasks if we don't call balance_pgdat
 		 * after returning from the refrigerator
 		 */
-		if (!ret)
+		if (!ret) {
+			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
 			balance_pgdat(pgdat, order);
+		}
 	}
 	return 0;
 }
@@ -2340,6 +2352,7 @@ void wakeup_kswapd(struct zone *zone, int order)
 		return;
 	if (pgdat->kswapd_max_order < order)
 		pgdat->kswapd_max_order = order;
+	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 		return;
 	if (!waitqueue_active(&pgdat->kswapd_wait))
@@ -2609,7 +2622,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		 */
 		priority = ZONE_RECLAIM_PRIORITY;
 		do {
-			note_zone_scanning_priority(zone, priority);
 			shrink_zone(priority, zone, &sc);
 			priority--;
 		} while (priority >= 0 && sc.nr_reclaimed < nr_pages);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7759941..5c0b1b6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -853,11 +853,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 	}
 	seq_printf(m,
 		   "\n  all_unreclaimable: %u"
-		   "\n  prev_priority:     %i"
 		   "\n  start_pfn:         %lu"
 		   "\n  inactive_ratio:    %u",
 		   zone->all_unreclaimable,
-		   zone->prev_priority,
 		   zone->zone_start_pfn,
 		   zone->inactive_ratio);
 	seq_putc(m, '\n');

^ permalink raw reply related	[flat|nested] 177+ messages in thread

* [PATCH 1/8] vmscan: tracing: Roll up of patches currently in mmotm
@ 2010-07-19 13:11   ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli, Mel Gorman

This is a roll-up of patches currently in an unreleased mmotm tree related to
stack reduction and tracing reclaim. The patches were taken from mm-commits
traffic. It is based on 2.6.35-rc5 and included for the convenience of
testing.

No signed off required.

--- 
 .../trace/postprocess/trace-vmscan-postprocess.pl  |  654 ++++++++++++++++++++
 include/linux/memcontrol.h                         |    5 -
 include/linux/mmzone.h                             |   15 -
 include/trace/events/gfpflags.h                    |   37 ++
 include/trace/events/kmem.h                        |   38 +--
 include/trace/events/vmscan.h                      |  184 ++++++
 mm/memcontrol.c                                    |   31 -
 mm/page_alloc.c                                    |    2 -
 mm/vmscan.c                                        |  414 +++++++------
 mm/vmstat.c                                        |    2 -
 10 files changed, 1089 insertions(+), 293 deletions(-)

diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
new file mode 100644
index 0000000..d1ddc33
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -0,0 +1,654 @@
+#!/usr/bin/perl
+# This is a POC for reading the text representation of trace output related to
+# page reclaim. It makes an attempt to extract some high-level information on
+# what is going on. The accuracy of the parser may vary
+#
+# Example usage: trace-vmscan-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
+# other options
+#   --read-procstat	If the trace lacks process info, get it from /proc
+#   --ignore-pid	Aggregate processes of the same name together
+#
+# Copyright (c) IBM Corporation 2009
+# Author: Mel Gorman <mel@csn.ul.ie>
+use strict;
+use Getopt::Long;
+
+# Tracepoint events
+use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN	=> 1;
+use constant MM_VMSCAN_DIRECT_RECLAIM_END	=> 2;
+use constant MM_VMSCAN_KSWAPD_WAKE		=> 3;
+use constant MM_VMSCAN_KSWAPD_SLEEP		=> 4;
+use constant MM_VMSCAN_LRU_SHRINK_ACTIVE	=> 5;
+use constant MM_VMSCAN_LRU_SHRINK_INACTIVE	=> 6;
+use constant MM_VMSCAN_LRU_ISOLATE		=> 7;
+use constant MM_VMSCAN_WRITEPAGE_SYNC		=> 8;
+use constant MM_VMSCAN_WRITEPAGE_ASYNC		=> 9;
+use constant EVENT_UNKNOWN			=> 10;
+
+# Per-order events
+use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER => 11;
+use constant MM_VMSCAN_WAKEUP_KSWAPD_PERORDER 	=> 12;
+use constant MM_VMSCAN_KSWAPD_WAKE_PERORDER	=> 13;
+use constant HIGH_KSWAPD_REWAKEUP_PERORDER	=> 14;
+
+# Constants used to track state
+use constant STATE_DIRECT_BEGIN 		=> 15;
+use constant STATE_DIRECT_ORDER 		=> 16;
+use constant STATE_KSWAPD_BEGIN			=> 17;
+use constant STATE_KSWAPD_ORDER			=> 18;
+
+# High-level events extrapolated from tracepoints
+use constant HIGH_DIRECT_RECLAIM_LATENCY	=> 19;
+use constant HIGH_KSWAPD_LATENCY		=> 20;
+use constant HIGH_KSWAPD_REWAKEUP		=> 21;
+use constant HIGH_NR_SCANNED			=> 22;
+use constant HIGH_NR_TAKEN			=> 23;
+use constant HIGH_NR_RECLAIM			=> 24;
+use constant HIGH_NR_CONTIG_DIRTY		=> 25;
+
+my %perprocesspid;
+my %perprocess;
+my %last_procmap;
+my $opt_ignorepid;
+my $opt_read_procstat;
+
+my $total_wakeup_kswapd;
+my ($total_direct_reclaim, $total_direct_nr_scanned);
+my ($total_direct_latency, $total_kswapd_latency);
+my ($total_direct_writepage_sync, $total_direct_writepage_async);
+my ($total_kswapd_nr_scanned, $total_kswapd_wake);
+my ($total_kswapd_writepage_sync, $total_kswapd_writepage_async);
+
+# Catch sigint and exit on request
+my $sigint_report = 0;
+my $sigint_exit = 0;
+my $sigint_pending = 0;
+my $sigint_received = 0;
+sub sigint_handler {
+	my $current_time = time;
+	if ($current_time - 2 > $sigint_received) {
+		print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
+		$sigint_report = 1;
+	} else {
+		if (!$sigint_exit) {
+			print "Second SIGINT received quickly, exiting\n";
+		}
+		$sigint_exit++;
+	}
+
+	if ($sigint_exit > 3) {
+		print "Many SIGINTs received, exiting now without report\n";
+		exit;
+	}
+
+	$sigint_received = $current_time;
+	$sigint_pending = 1;
+}
+$SIG{INT} = "sigint_handler";
+
+# Parse command line options
+GetOptions(
+	'ignore-pid'	 =>	\$opt_ignorepid,
+	'read-procstat'	 =>	\$opt_read_procstat,
+);
+
+# Defaults for dynamically discovered regex's
+my $regex_direct_begin_default = 'order=([0-9]*) may_writepage=([0-9]*) gfp_flags=([A-Z_|]*)';
+my $regex_direct_end_default = 'nr_reclaimed=([0-9]*)';
+my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)';
+my $regex_kswapd_sleep_default = 'nid=([0-9]*)';
+my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
+my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)';
+my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)';
+my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
+my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) sync_io=([0-9]*)';
+
+# Dyanically discovered regex
+my $regex_direct_begin;
+my $regex_direct_end;
+my $regex_kswapd_wake;
+my $regex_kswapd_sleep;
+my $regex_wakeup_kswapd;
+my $regex_lru_isolate;
+my $regex_lru_shrink_inactive;
+my $regex_lru_shrink_active;
+my $regex_writepage;
+
+# Static regex used. Specified like this for readability and for use with /o
+#                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
+my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
+my $regex_statname = '[-0-9]*\s\((.*)\).*';
+my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
+
+sub generate_traceevent_regex {
+	my $event = shift;
+	my $default = shift;
+	my $regex;
+
+	# Read the event format or use the default
+	if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
+		print("WARNING: Event $event format string not found\n");
+		return $default;
+	} else {
+		my $line;
+		while (!eof(FORMAT)) {
+			$line = <FORMAT>;
+			$line =~ s/, REC->.*//;
+			if ($line =~ /^print fmt:\s"(.*)".*/) {
+				$regex = $1;
+				$regex =~ s/%s/\([0-9a-zA-Z|_]*\)/g;
+				$regex =~ s/%p/\([0-9a-f]*\)/g;
+				$regex =~ s/%d/\([-0-9]*\)/g;
+				$regex =~ s/%ld/\([-0-9]*\)/g;
+				$regex =~ s/%lu/\([0-9]*\)/g;
+			}
+		}
+	}
+
+	# Can't handle the print_flags stuff but in the context of this
+	# script, it really doesn't matter
+	$regex =~ s/\(REC.*\) \? __print_flags.*//;
+
+	# Verify fields are in the right order
+	my $tuple;
+	foreach $tuple (split /\s/, $regex) {
+		my ($key, $value) = split(/=/, $tuple);
+		my $expected = shift;
+		if ($key ne $expected) {
+			print("WARNING: Format not as expected for event $event '$key' != '$expected'\n");
+			$regex =~ s/$key=\((.*)\)/$key=$1/;
+		}
+	}
+
+	if (defined shift) {
+		die("Fewer fields than expected in format");
+	}
+
+	return $regex;
+}
+
+$regex_direct_begin = generate_traceevent_regex(
+			"vmscan/mm_vmscan_direct_reclaim_begin",
+			$regex_direct_begin_default,
+			"order", "may_writepage",
+			"gfp_flags");
+$regex_direct_end = generate_traceevent_regex(
+			"vmscan/mm_vmscan_direct_reclaim_end",
+			$regex_direct_end_default,
+			"nr_reclaimed");
+$regex_kswapd_wake = generate_traceevent_regex(
+			"vmscan/mm_vmscan_kswapd_wake",
+			$regex_kswapd_wake_default,
+			"nid", "order");
+$regex_kswapd_sleep = generate_traceevent_regex(
+			"vmscan/mm_vmscan_kswapd_sleep",
+			$regex_kswapd_sleep_default,
+			"nid");
+$regex_wakeup_kswapd = generate_traceevent_regex(
+			"vmscan/mm_vmscan_wakeup_kswapd",
+			$regex_wakeup_kswapd_default,
+			"nid", "zid", "order");
+$regex_lru_isolate = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_isolate",
+			$regex_lru_isolate_default,
+			"isolate_mode", "order",
+			"nr_requested", "nr_scanned", "nr_taken",
+			"contig_taken", "contig_dirty", "contig_failed");
+$regex_lru_shrink_inactive = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_shrink_inactive",
+			$regex_lru_shrink_inactive_default,
+			"nid", "zid",
+			"lru",
+			"nr_scanned", "nr_reclaimed", "priority");
+$regex_lru_shrink_active = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_shrink_active",
+			$regex_lru_shrink_active_default,
+			"nid", "zid",
+			"lru",
+			"nr_scanned", "nr_rotated", "priority");
+$regex_writepage = generate_traceevent_regex(
+			"vmscan/mm_vmscan_writepage",
+			$regex_writepage_default,
+			"page", "pfn", "sync_io");
+
+sub read_statline($) {
+	my $pid = $_[0];
+	my $statline;
+
+	if (open(STAT, "/proc/$pid/stat")) {
+		$statline = <STAT>;
+		close(STAT);
+	}
+
+	if ($statline eq '') {
+		$statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
+	}
+
+	return $statline;
+}
+
+sub guess_process_pid($$) {
+	my $pid = $_[0];
+	my $statline = $_[1];
+
+	if ($pid == 0) {
+		return "swapper-0";
+	}
+
+	if ($statline !~ /$regex_statname/o) {
+		die("Failed to math stat line for process name :: $statline");
+	}
+	return "$1-$pid";
+}
+
+# Convert sec.usec timestamp format
+sub timestamp_to_ms($) {
+	my $timestamp = $_[0];
+
+	my ($sec, $usec) = split (/\./, $timestamp);
+	return ($sec * 1000) + ($usec / 1000);
+}
+
+sub process_events {
+	my $traceevent;
+	my $process_pid;
+	my $cpus;
+	my $timestamp;
+	my $tracepoint;
+	my $details;
+	my $statline;
+
+	# Read each line of the event log
+EVENT_PROCESS:
+	while ($traceevent = <STDIN>) {
+		if ($traceevent =~ /$regex_traceevent/o) {
+			$process_pid = $1;
+			$timestamp = $3;
+			$tracepoint = $4;
+
+			$process_pid =~ /(.*)-([0-9]*)$/;
+			my $process = $1;
+			my $pid = $2;
+
+			if ($process eq "") {
+				$process = $last_procmap{$pid};
+				$process_pid = "$process-$pid";
+			}
+			$last_procmap{$pid} = $process;
+
+			if ($opt_read_procstat) {
+				$statline = read_statline($pid);
+				if ($opt_read_procstat && $process eq '') {
+					$process_pid = guess_process_pid($pid, $statline);
+				}
+			}
+		} else {
+			next;
+		}
+
+		# Perl Switch() sucks majorly
+		if ($tracepoint eq "mm_vmscan_direct_reclaim_begin") {
+			$timestamp = timestamp_to_ms($timestamp);
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}++;
+			$perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN} = $timestamp;
+
+			$details = $5;
+			if ($details !~ /$regex_direct_begin/o) {
+				print "WARNING: Failed to parse mm_vmscan_direct_reclaim_begin as expected\n";
+				print "         $details\n";
+				print "         $regex_direct_begin\n";
+				next;
+			}
+			my $order = $1;
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]++;
+			$perprocesspid{$process_pid}->{STATE_DIRECT_ORDER} = $order;
+		} elsif ($tracepoint eq "mm_vmscan_direct_reclaim_end") {
+			# Count the event itself
+			my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END};
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END}++;
+
+			# Record how long direct reclaim took this time
+			if (defined $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN}) {
+				$timestamp = timestamp_to_ms($timestamp);
+				my $order = $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER};
+				my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN});
+				$perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] = "$order-$latency";
+			}
+		} elsif ($tracepoint eq "mm_vmscan_kswapd_wake") {
+			$details = $5;
+			if ($details !~ /$regex_kswapd_wake/o) {
+				print "WARNING: Failed to parse mm_vmscan_kswapd_wake as expected\n";
+				print "         $details\n";
+				print "         $regex_kswapd_wake\n";
+				next;
+			}
+
+			my $order = $2;
+			$perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER} = $order;
+			if (!$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN}) {
+				$timestamp = timestamp_to_ms($timestamp);
+				$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}++;
+				$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = $timestamp;
+				$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]++;
+			} else {
+				$perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}++;
+				$perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order]++;
+			}
+		} elsif ($tracepoint eq "mm_vmscan_kswapd_sleep") {
+
+			# Count the event itself
+			my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP};
+			$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP}++;
+
+			# Record how long kswapd was awake
+			$timestamp = timestamp_to_ms($timestamp);
+			my $order = $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER};
+			my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN});
+			$perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index] = "$order-$latency";
+			$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = 0;
+		} elsif ($tracepoint eq "mm_vmscan_wakeup_kswapd") {
+			$perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}++;
+
+			$details = $5;
+			if ($details !~ /$regex_wakeup_kswapd/o) {
+				print "WARNING: Failed to parse mm_vmscan_wakeup_kswapd as expected\n";
+				print "         $details\n";
+				print "         $regex_wakeup_kswapd\n";
+				next;
+			}
+			my $order = $3;
+			$perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]++;
+		} elsif ($tracepoint eq "mm_vmscan_lru_isolate") {
+			$details = $5;
+			if ($details !~ /$regex_lru_isolate/o) {
+				print "WARNING: Failed to parse mm_vmscan_lru_isolate as expected\n";
+				print "         $details\n";
+				print "         $regex_lru_isolate/o\n";
+				next;
+			}
+			my $nr_scanned = $4;
+			my $nr_contig_dirty = $7;
+			$perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
+			$perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty;
+		} elsif ($tracepoint eq "mm_vmscan_writepage") {
+			$details = $5;
+			if ($details !~ /$regex_writepage/o) {
+				print "WARNING: Failed to parse mm_vmscan_writepage as expected\n";
+				print "         $details\n";
+				print "         $regex_writepage\n";
+				next;
+			}
+
+			my $sync_io = $3;
+			if ($sync_io) {
+				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}++;
+			} else {
+				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}++;
+			}
+		} else {
+			$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
+		}
+
+		if ($sigint_pending) {
+			last EVENT_PROCESS;
+		}
+	}
+}
+
+sub dump_stats {
+	my $hashref = shift;
+	my %stats = %$hashref;
+
+	# Dump per-process stats
+	my $process_pid;
+	my $max_strlen = 0;
+
+	# Get the maximum process name
+	foreach $process_pid (keys %perprocesspid) {
+		my $len = length($process_pid);
+		if ($len > $max_strlen) {
+			$max_strlen = $len;
+		}
+	}
+	$max_strlen += 2;
+
+	# Work out latencies
+	printf("\n") if !$opt_ignorepid;
+	printf("Reclaim latencies expressed as order-latency_in_ms\n") if !$opt_ignorepid;
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[0] &&
+				!$stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[0]) {
+			next;
+		}
+
+		printf "%-" . $max_strlen . "s ", $process_pid if !$opt_ignorepid;
+		my $index = 0;
+		while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] ||
+			defined $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) {
+
+			if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
+				printf("%s ", $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) if !$opt_ignorepid;
+				my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
+				$total_direct_latency += $latency;
+			} else {
+				printf("%s ", $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) if !$opt_ignorepid;
+				my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]);
+				$total_kswapd_latency += $latency;
+			}
+			$index++;
+		}
+		print "\n" if !$opt_ignorepid;
+	}
+
+	# Print out process activity
+	printf("\n");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "Process", "Direct",  "Wokeup", "Pages",   "Pages",   "Pages",     "Time");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "details", "Rclms",   "Kswapd", "Scanned", "Sync-IO", "ASync-IO",  "Stalled");
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
+			next;
+		}
+
+		$total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
+		$total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
+		$total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_direct_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$total_direct_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		my $index = 0;
+		my $this_reclaim_delay = 0;
+		while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
+			 my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
+			$this_reclaim_delay += $latency;
+			$index++;
+		}
+
+		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8u %8u %8.3f",
+			$process_pid,
+			$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
+			$stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
+			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC},
+			$this_reclaim_delay / 1000);
+
+		if ($stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
+				if ($count != 0) {
+					print "direct-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
+				if ($count != 0) {
+					print "wakeup-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY}) {
+			print "      ";
+			my $count = $stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY};
+			if ($count != 0) {
+				print "contig-dirty=$count ";
+			}
+		}
+
+		print "\n";
+	}
+
+	# Print out kswapd activity
+	printf("\n");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Kswapd",   "Kswapd",  "Order",     "Pages",   "Pages",  "Pages");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Sync-IO", "ASync-IO");
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
+			next;
+		}
+
+		$total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
+		$total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_kswapd_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$total_kswapd_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8i %8u",
+			$process_pid,
+			$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
+			$stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
+			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC});
+
+		if ($stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
+				if ($count != 0) {
+					print "wake-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order];
+				if ($count != 0) {
+					print "rewake-$order=$count ";
+				}
+			}
+		}
+		printf("\n");
+	}
+
+	# Print out summaries
+	$total_direct_latency /= 1000;
+	$total_kswapd_latency /= 1000;
+	print "\nSummary\n";
+	print "Direct reclaims:     		$total_direct_reclaim\n";
+	print "Direct reclaim pages scanned:	$total_direct_nr_scanned\n";
+	print "Direct reclaim write sync I/O:	$total_direct_writepage_sync\n";
+	print "Direct reclaim write async I/O:	$total_direct_writepage_async\n";
+	print "Wake kswapd requests:		$total_wakeup_kswapd\n";
+	printf "Time stalled direct reclaim: 	%-1.2f ms\n", $total_direct_latency;
+	print "\n";
+	print "Kswapd wakeups:			$total_kswapd_wake\n";
+	print "Kswapd pages scanned:		$total_kswapd_nr_scanned\n";
+	print "Kswapd reclaim write sync I/O:	$total_kswapd_writepage_sync\n";
+	print "Kswapd reclaim write async I/O:	$total_kswapd_writepage_async\n";
+	printf "Time kswapd awake:		%-1.2f ms\n", $total_kswapd_latency;
+}
+
+sub aggregate_perprocesspid() {
+	my $process_pid;
+	my $process;
+	undef %perprocess;
+
+	foreach $process_pid (keys %perprocesspid) {
+		$process = $process_pid;
+		$process =~ s/-([0-9])*$//;
+		if ($process eq '') {
+			$process = "NO_PROCESS_NAME";
+		}
+
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN} += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
+		$perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE} += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
+		$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
+		$perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
+		$perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		for (my $order = 0; $order < 20; $order++) {
+			$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
+			$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
+			$perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
+
+		}
+
+		# Aggregate direct reclaim latencies
+		my $wr_index = $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END};
+		my $rd_index = 0;
+		while (defined $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index]) {
+			$perprocess{$process}->{HIGH_DIRECT_RECLAIM_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index];
+			$rd_index++;
+			$wr_index++;
+		}
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
+
+		# Aggregate kswapd latencies
+		my $wr_index = $perprocess{$process}->{MM_VMSCAN_KSWAPD_SLEEP};
+		my $rd_index = 0;
+		while (defined $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index]) {
+			$perprocess{$process}->{HIGH_KSWAPD_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index];
+			$rd_index++;
+			$wr_index++;
+		}
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
+	}
+}
+
+sub report() {
+	if (!$opt_ignorepid) {
+		dump_stats(\%perprocesspid);
+	} else {
+		aggregate_perprocesspid();
+		dump_stats(\%perprocess);
+	}
+}
+
+# Process events or signals until neither is available
+sub signal_loop() {
+	my $sigint_processed;
+	do {
+		$sigint_processed = 0;
+		process_events();
+
+		# Handle pending signals if any
+		if ($sigint_pending) {
+			my $current_time = time;
+
+			if ($sigint_exit) {
+				print "Received exit signal\n";
+				$sigint_pending = 0;
+			}
+			if ($sigint_report) {
+				if ($current_time >= $sigint_received + 2) {
+					report();
+					$sigint_report = 0;
+					$sigint_pending = 0;
+					$sigint_processed = 1;
+				}
+			}
+		}
+	} while ($sigint_pending || $sigint_processed);
+}
+
+signal_loop();
+report();
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9411d32..9f1afd3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -98,11 +98,6 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
 /*
  * For memory reclaim.
  */
-extern int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem);
-extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
-							int priority);
-extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
-							int priority);
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4d109e..b578eee 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -348,21 +348,6 @@ struct zone {
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 
 	/*
-	 * prev_priority holds the scanning priority for this zone.  It is
-	 * defined as the scanning priority at which we achieved our reclaim
-	 * target at the previous try_to_free_pages() or balance_pgdat()
-	 * invocation.
-	 *
-	 * We use prev_priority as a measure of how much stress page reclaim is
-	 * under - it drives the swappiness decision: whether to unmap mapped
-	 * pages.
-	 *
-	 * Access to both this field is quite racy even on uniprocessor.  But
-	 * it is expected to average out OK.
-	 */
-	int prev_priority;
-
-	/*
 	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
 	 * this zone's LRU.  Maintained by the pageout code.
 	 */
diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
new file mode 100644
index 0000000..e3615c0
--- /dev/null
+++ b/include/trace/events/gfpflags.h
@@ -0,0 +1,37 @@
+/*
+ * The order of these masks is important. Matching masks will be seen
+ * first and the left over flags will end up showing by themselves.
+ *
+ * For example, if we have GFP_KERNEL before GFP_USER we wil get:
+ *
+ *  GFP_KERNEL|GFP_HARDWALL
+ *
+ * Thus most bits set go first.
+ */
+#define show_gfp_flags(flags)						\
+	(flags) ? __print_flags(flags, "|",				\
+	{(unsigned long)GFP_HIGHUSER_MOVABLE,	"GFP_HIGHUSER_MOVABLE"}, \
+	{(unsigned long)GFP_HIGHUSER,		"GFP_HIGHUSER"},	\
+	{(unsigned long)GFP_USER,		"GFP_USER"},		\
+	{(unsigned long)GFP_TEMPORARY,		"GFP_TEMPORARY"},	\
+	{(unsigned long)GFP_KERNEL,		"GFP_KERNEL"},		\
+	{(unsigned long)GFP_NOFS,		"GFP_NOFS"},		\
+	{(unsigned long)GFP_ATOMIC,		"GFP_ATOMIC"},		\
+	{(unsigned long)GFP_NOIO,		"GFP_NOIO"},		\
+	{(unsigned long)__GFP_HIGH,		"GFP_HIGH"},		\
+	{(unsigned long)__GFP_WAIT,		"GFP_WAIT"},		\
+	{(unsigned long)__GFP_IO,		"GFP_IO"},		\
+	{(unsigned long)__GFP_COLD,		"GFP_COLD"},		\
+	{(unsigned long)__GFP_NOWARN,		"GFP_NOWARN"},		\
+	{(unsigned long)__GFP_REPEAT,		"GFP_REPEAT"},		\
+	{(unsigned long)__GFP_NOFAIL,		"GFP_NOFAIL"},		\
+	{(unsigned long)__GFP_NORETRY,		"GFP_NORETRY"},		\
+	{(unsigned long)__GFP_COMP,		"GFP_COMP"},		\
+	{(unsigned long)__GFP_ZERO,		"GFP_ZERO"},		\
+	{(unsigned long)__GFP_NOMEMALLOC,	"GFP_NOMEMALLOC"},	\
+	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
+	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
+	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
+	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"}		\
+	) : "GFP_NOWAIT"
+
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 3adca0c..a9c87ad 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -6,43 +6,7 @@
 
 #include <linux/types.h>
 #include <linux/tracepoint.h>
-
-/*
- * The order of these masks is important. Matching masks will be seen
- * first and the left over flags will end up showing by themselves.
- *
- * For example, if we have GFP_KERNEL before GFP_USER we wil get:
- *
- *  GFP_KERNEL|GFP_HARDWALL
- *
- * Thus most bits set go first.
- */
-#define show_gfp_flags(flags)						\
-	(flags) ? __print_flags(flags, "|",				\
-	{(unsigned long)GFP_HIGHUSER_MOVABLE,	"GFP_HIGHUSER_MOVABLE"}, \
-	{(unsigned long)GFP_HIGHUSER,		"GFP_HIGHUSER"},	\
-	{(unsigned long)GFP_USER,		"GFP_USER"},		\
-	{(unsigned long)GFP_TEMPORARY,		"GFP_TEMPORARY"},	\
-	{(unsigned long)GFP_KERNEL,		"GFP_KERNEL"},		\
-	{(unsigned long)GFP_NOFS,		"GFP_NOFS"},		\
-	{(unsigned long)GFP_ATOMIC,		"GFP_ATOMIC"},		\
-	{(unsigned long)GFP_NOIO,		"GFP_NOIO"},		\
-	{(unsigned long)__GFP_HIGH,		"GFP_HIGH"},		\
-	{(unsigned long)__GFP_WAIT,		"GFP_WAIT"},		\
-	{(unsigned long)__GFP_IO,		"GFP_IO"},		\
-	{(unsigned long)__GFP_COLD,		"GFP_COLD"},		\
-	{(unsigned long)__GFP_NOWARN,		"GFP_NOWARN"},		\
-	{(unsigned long)__GFP_REPEAT,		"GFP_REPEAT"},		\
-	{(unsigned long)__GFP_NOFAIL,		"GFP_NOFAIL"},		\
-	{(unsigned long)__GFP_NORETRY,		"GFP_NORETRY"},		\
-	{(unsigned long)__GFP_COMP,		"GFP_COMP"},		\
-	{(unsigned long)__GFP_ZERO,		"GFP_ZERO"},		\
-	{(unsigned long)__GFP_NOMEMALLOC,	"GFP_NOMEMALLOC"},	\
-	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
-	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
-	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
-	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"}		\
-	) : "GFP_NOWAIT"
+#include "gfpflags.h"
 
 DECLARE_EVENT_CLASS(kmem_alloc,
 
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
new file mode 100644
index 0000000..f2da66a
--- /dev/null
+++ b/include/trace/events/vmscan.h
@@ -0,0 +1,184 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vmscan
+
+#if !defined(_TRACE_VMSCAN_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VMSCAN_H
+
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+#include "gfpflags.h"
+
+TRACE_EVENT(mm_vmscan_kswapd_sleep,
+
+	TP_PROTO(int nid),
+
+	TP_ARGS(nid),
+
+	TP_STRUCT__entry(
+		__field(	int,	nid	)
+	),
+
+	TP_fast_assign(
+		__entry->nid	= nid;
+	),
+
+	TP_printk("nid=%d", __entry->nid)
+);
+
+TRACE_EVENT(mm_vmscan_kswapd_wake,
+
+	TP_PROTO(int nid, int order),
+
+	TP_ARGS(nid, order),
+
+	TP_STRUCT__entry(
+		__field(	int,	nid	)
+		__field(	int,	order	)
+	),
+
+	TP_fast_assign(
+		__entry->nid	= nid;
+		__entry->order	= order;
+	),
+
+	TP_printk("nid=%d order=%d", __entry->nid, __entry->order)
+);
+
+TRACE_EVENT(mm_vmscan_wakeup_kswapd,
+
+	TP_PROTO(int nid, int zid, int order),
+
+	TP_ARGS(nid, zid, order),
+
+	TP_STRUCT__entry(
+		__field(	int,		nid	)
+		__field(	int,		zid	)
+		__field(	int,		order	)
+	),
+
+	TP_fast_assign(
+		__entry->nid		= nid;
+		__entry->zid		= zid;
+		__entry->order		= order;
+	),
+
+	TP_printk("nid=%d zid=%d order=%d",
+		__entry->nid,
+		__entry->zid,
+		__entry->order)
+);
+
+TRACE_EVENT(mm_vmscan_direct_reclaim_begin,
+
+	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags),
+
+	TP_ARGS(order, may_writepage, gfp_flags),
+
+	TP_STRUCT__entry(
+		__field(	int,	order		)
+		__field(	int,	may_writepage	)
+		__field(	gfp_t,	gfp_flags	)
+	),
+
+	TP_fast_assign(
+		__entry->order		= order;
+		__entry->may_writepage	= may_writepage;
+		__entry->gfp_flags	= gfp_flags;
+	),
+
+	TP_printk("order=%d may_writepage=%d gfp_flags=%s",
+		__entry->order,
+		__entry->may_writepage,
+		show_gfp_flags(__entry->gfp_flags))
+);
+
+TRACE_EVENT(mm_vmscan_direct_reclaim_end,
+
+	TP_PROTO(unsigned long nr_reclaimed),
+
+	TP_ARGS(nr_reclaimed),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	nr_reclaimed	)
+	),
+
+	TP_fast_assign(
+		__entry->nr_reclaimed	= nr_reclaimed;
+	),
+
+	TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed)
+);
+
+TRACE_EVENT(mm_vmscan_lru_isolate,
+
+	TP_PROTO(int order,
+		unsigned long nr_requested,
+		unsigned long nr_scanned,
+		unsigned long nr_taken,
+		unsigned long nr_lumpy_taken,
+		unsigned long nr_lumpy_dirty,
+		unsigned long nr_lumpy_failed,
+		int isolate_mode),
+
+	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, isolate_mode),
+
+	TP_STRUCT__entry(
+		__field(int, order)
+		__field(unsigned long, nr_requested)
+		__field(unsigned long, nr_scanned)
+		__field(unsigned long, nr_taken)
+		__field(unsigned long, nr_lumpy_taken)
+		__field(unsigned long, nr_lumpy_dirty)
+		__field(unsigned long, nr_lumpy_failed)
+		__field(int, isolate_mode)
+	),
+
+	TP_fast_assign(
+		__entry->order = order;
+		__entry->nr_requested = nr_requested;
+		__entry->nr_scanned = nr_scanned;
+		__entry->nr_taken = nr_taken;
+		__entry->nr_lumpy_taken = nr_lumpy_taken;
+		__entry->nr_lumpy_dirty = nr_lumpy_dirty;
+		__entry->nr_lumpy_failed = nr_lumpy_failed;
+		__entry->isolate_mode = isolate_mode;
+	),
+
+	TP_printk("isolate_mode=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu contig_taken=%lu contig_dirty=%lu contig_failed=%lu",
+		__entry->isolate_mode,
+		__entry->order,
+		__entry->nr_requested,
+		__entry->nr_scanned,
+		__entry->nr_taken,
+		__entry->nr_lumpy_taken,
+		__entry->nr_lumpy_dirty,
+		__entry->nr_lumpy_failed)
+);
+
+TRACE_EVENT(mm_vmscan_writepage,
+
+	TP_PROTO(struct page *page,
+		int sync_io),
+
+	TP_ARGS(page, sync_io),
+
+	TP_STRUCT__entry(
+		__field(struct page *, page)
+		__field(int, sync_io)
+	),
+
+	TP_fast_assign(
+		__entry->page = page;
+		__entry->sync_io = sync_io;
+	),
+
+	TP_printk("page=%p pfn=%lu sync_io=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->sync_io)
+);
+
+#endif /* _TRACE_VMSCAN_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 20a8193..31abd1c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -211,8 +211,6 @@ struct mem_cgroup {
 	*/
 	spinlock_t reclaim_param_lock;
 
-	int	prev_priority;	/* for recording reclaim priority */
-
 	/*
 	 * While reclaiming in a hierarchy, we cache the last child we
 	 * reclaimed from.
@@ -858,35 +856,6 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
 	return ret;
 }
 
-/*
- * prev_priority control...this will be used in memory reclaim path.
- */
-int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem)
-{
-	int prev_priority;
-
-	spin_lock(&mem->reclaim_param_lock);
-	prev_priority = mem->prev_priority;
-	spin_unlock(&mem->reclaim_param_lock);
-
-	return prev_priority;
-}
-
-void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem, int priority)
-{
-	spin_lock(&mem->reclaim_param_lock);
-	if (priority < mem->prev_priority)
-		mem->prev_priority = priority;
-	spin_unlock(&mem->reclaim_param_lock);
-}
-
-void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, int priority)
-{
-	spin_lock(&mem->reclaim_param_lock);
-	mem->prev_priority = priority;
-	spin_unlock(&mem->reclaim_param_lock);
-}
-
 static int calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_pages)
 {
 	unsigned long active;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 431214b..0b0b629 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4081,8 +4081,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		zone_seqlock_init(zone);
 		zone->zone_pgdat = pgdat;
 
-		zone->prev_priority = DEF_PRIORITY;
-
 		zone_pcp_init(zone);
 		for_each_lru(l) {
 			INIT_LIST_HEAD(&zone->lru[l].list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c7e57c..e6ddba9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -48,6 +48,9 @@
 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/vmscan.h>
+
 struct scan_control {
 	/* Incremented by the number of inactive pages that were scanned */
 	unsigned long nr_scanned;
@@ -290,13 +293,13 @@ static int may_write_to_queue(struct backing_dev_info *bdi)
  * prevents it from being freed up.  But we have a ref on the page and once
  * that page is locked, the mapping is pinned.
  *
- * We're allowed to run sleeping lock_page() here because we know the caller has
- * __GFP_FS.
+ * We're allowed to run sleeping lock_page_nosync() here because we know the
+ * caller has __GFP_FS.
  */
 static void handle_write_error(struct address_space *mapping,
 				struct page *page, int error)
 {
-	lock_page(page);
+	lock_page_nosync(page);
 	if (page_mapping(page) == mapping)
 		mapping_set_error(mapping, error);
 	unlock_page(page);
@@ -396,6 +399,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			/* synchronous write or broken a_ops? */
 			ClearPageReclaim(page);
 		}
+		trace_mm_vmscan_writepage(page,
+			sync_writeback == PAGEOUT_IO_SYNC);
 		inc_zone_page_state(page, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
 	}
@@ -615,6 +620,24 @@ static enum page_references page_check_references(struct page *page,
 	return PAGEREF_RECLAIM;
 }
 
+static noinline_for_stack void free_page_list(struct list_head *free_pages)
+{
+	struct pagevec freed_pvec;
+	struct page *page, *tmp;
+
+	pagevec_init(&freed_pvec, 1);
+
+	list_for_each_entry_safe(page, tmp, free_pages, lru) {
+		list_del(&page->lru);
+		if (!pagevec_add(&freed_pvec, page)) {
+			__pagevec_free(&freed_pvec);
+			pagevec_reinit(&freed_pvec);
+		}
+	}
+
+	pagevec_free(&freed_pvec);
+}
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -623,13 +646,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 					enum pageout_io sync_writeback)
 {
 	LIST_HEAD(ret_pages);
-	struct pagevec freed_pvec;
+	LIST_HEAD(free_pages);
 	int pgactivate = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
 
-	pagevec_init(&freed_pvec, 1);
 	while (!list_empty(page_list)) {
 		enum page_references references;
 		struct address_space *mapping;
@@ -804,10 +826,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		__clear_page_locked(page);
 free_it:
 		nr_reclaimed++;
-		if (!pagevec_add(&freed_pvec, page)) {
-			__pagevec_free(&freed_pvec);
-			pagevec_reinit(&freed_pvec);
-		}
+
+		/*
+		 * Is there need to periodically free_page_list? It would
+		 * appear not as the counts should be low
+		 */
+		list_add(&page->lru, &free_pages);
 		continue;
 
 cull_mlocked:
@@ -830,9 +854,10 @@ keep:
 		list_add(&page->lru, &ret_pages);
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
+
+	free_page_list(&free_pages);
+
 	list_splice(&ret_pages, page_list);
-	if (pagevec_count(&freed_pvec))
-		__pagevec_free(&freed_pvec);
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
@@ -914,6 +939,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		unsigned long *scanned, int order, int mode, int file)
 {
 	unsigned long nr_taken = 0;
+	unsigned long nr_lumpy_taken = 0, nr_lumpy_dirty = 0, nr_lumpy_failed = 0;
 	unsigned long scan;
 
 	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
@@ -991,12 +1017,25 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 				list_move(&cursor_page->lru, dst);
 				mem_cgroup_del_lru(cursor_page);
 				nr_taken++;
+				nr_lumpy_taken++;
+				if (PageDirty(cursor_page))
+					nr_lumpy_dirty++;
 				scan++;
+			} else {
+				if (mode == ISOLATE_BOTH &&
+						page_count(cursor_page))
+					nr_lumpy_failed++;
 			}
 		}
 	}
 
 	*scanned = scan;
+
+	trace_mm_vmscan_lru_isolate(order,
+			nr_to_scan, scan,
+			nr_taken,
+			nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed,
+			mode);
 	return nr_taken;
 }
 
@@ -1033,7 +1072,8 @@ static unsigned long clear_active_flags(struct list_head *page_list,
 			ClearPageActive(page);
 			nr_active++;
 		}
-		count[lru]++;
+		if (count)
+			count[lru]++;
 	}
 
 	return nr_active;
@@ -1110,174 +1150,177 @@ static int too_many_isolated(struct zone *zone, int file,
 }
 
 /*
- * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
- * of reclaimed pages
+ * TODO: Try merging with migrations version of putback_lru_pages
  */
-static unsigned long shrink_inactive_list(unsigned long max_scan,
-			struct zone *zone, struct scan_control *sc,
-			int priority, int file)
+static noinline_for_stack void
+putback_lru_pages(struct zone *zone, struct scan_control *sc,
+				unsigned long nr_anon, unsigned long nr_file,
+				struct list_head *page_list)
 {
-	LIST_HEAD(page_list);
+	struct page *page;
 	struct pagevec pvec;
-	unsigned long nr_scanned = 0;
-	unsigned long nr_reclaimed = 0;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 
-	while (unlikely(too_many_isolated(zone, file, sc))) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+	pagevec_init(&pvec, 1);
 
-		/* We are about to die and free our memory. Return now. */
-		if (fatal_signal_pending(current))
-			return SWAP_CLUSTER_MAX;
+	/*
+	 * Put back any unfreeable pages.
+	 */
+	spin_lock(&zone->lru_lock);
+	while (!list_empty(page_list)) {
+		int lru;
+		page = lru_to_page(page_list);
+		VM_BUG_ON(PageLRU(page));
+		list_del(&page->lru);
+		if (unlikely(!page_evictable(page, NULL))) {
+			spin_unlock_irq(&zone->lru_lock);
+			putback_lru_page(page);
+			spin_lock_irq(&zone->lru_lock);
+			continue;
+		}
+		SetPageLRU(page);
+		lru = page_lru(page);
+		add_page_to_lru_list(zone, page, lru);
+		if (is_active_lru(lru)) {
+			int file = is_file_lru(lru);
+			reclaim_stat->recent_rotated[file]++;
+		}
+		if (!pagevec_add(&pvec, page)) {
+			spin_unlock_irq(&zone->lru_lock);
+			__pagevec_release(&pvec);
+			spin_lock_irq(&zone->lru_lock);
+		}
 	}
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
 
+	spin_unlock_irq(&zone->lru_lock);
+	pagevec_release(&pvec);
+}
 
-	pagevec_init(&pvec, 1);
+static noinline_for_stack void update_isolated_counts(struct zone *zone,
+					struct scan_control *sc,
+					unsigned long *nr_anon,
+					unsigned long *nr_file,
+					struct list_head *isolated_list)
+{
+	unsigned long nr_active;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 
-	lru_add_drain();
-	spin_lock_irq(&zone->lru_lock);
-	do {
-		struct page *page;
-		unsigned long nr_taken;
-		unsigned long nr_scan;
-		unsigned long nr_freed;
-		unsigned long nr_active;
-		unsigned int count[NR_LRU_LISTS] = { 0, };
-		int mode = sc->lumpy_reclaim_mode ? ISOLATE_BOTH : ISOLATE_INACTIVE;
-		unsigned long nr_anon;
-		unsigned long nr_file;
+	nr_active = clear_active_flags(isolated_list, count);
+	__count_vm_events(PGDEACTIVATE, nr_active);
 
-		if (scanning_global_lru(sc)) {
-			nr_taken = isolate_pages_global(SWAP_CLUSTER_MAX,
-							&page_list, &nr_scan,
-							sc->order, mode,
-							zone, 0, file);
-			zone->pages_scanned += nr_scan;
-			if (current_is_kswapd())
-				__count_zone_vm_events(PGSCAN_KSWAPD, zone,
-						       nr_scan);
-			else
-				__count_zone_vm_events(PGSCAN_DIRECT, zone,
-						       nr_scan);
-		} else {
-			nr_taken = mem_cgroup_isolate_pages(SWAP_CLUSTER_MAX,
-							&page_list, &nr_scan,
-							sc->order, mode,
-							zone, sc->mem_cgroup,
-							0, file);
-			/*
-			 * mem_cgroup_isolate_pages() keeps track of
-			 * scanned pages on its own.
-			 */
-		}
+	__mod_zone_page_state(zone, NR_ACTIVE_FILE,
+			      -count[LRU_ACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_INACTIVE_FILE,
+			      -count[LRU_INACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_ACTIVE_ANON,
+			      -count[LRU_ACTIVE_ANON]);
+	__mod_zone_page_state(zone, NR_INACTIVE_ANON,
+			      -count[LRU_INACTIVE_ANON]);
 
-		if (nr_taken == 0)
-			goto done;
+	*nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+	*nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, *nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, *nr_file);
 
-		nr_active = clear_active_flags(&page_list, count);
-		__count_vm_events(PGDEACTIVATE, nr_active);
+	reclaim_stat->recent_scanned[0] += *nr_anon;
+	reclaim_stat->recent_scanned[1] += *nr_file;
+}
 
-		__mod_zone_page_state(zone, NR_ACTIVE_FILE,
-						-count[LRU_ACTIVE_FILE]);
-		__mod_zone_page_state(zone, NR_INACTIVE_FILE,
-						-count[LRU_INACTIVE_FILE]);
-		__mod_zone_page_state(zone, NR_ACTIVE_ANON,
-						-count[LRU_ACTIVE_ANON]);
-		__mod_zone_page_state(zone, NR_INACTIVE_ANON,
-						-count[LRU_INACTIVE_ANON]);
+/*
+ * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
+ * of reclaimed pages
+ */
+static noinline_for_stack unsigned long
+shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
+			struct scan_control *sc, int priority, int file)
+{
+	LIST_HEAD(page_list);
+	unsigned long nr_scanned;
+	unsigned long nr_reclaimed = 0;
+	unsigned long nr_taken;
+	unsigned long nr_active;
+	unsigned long nr_anon;
+	unsigned long nr_file;
 
-		nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
-		nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
-		__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
-		__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
+	while (unlikely(too_many_isolated(zone, file, sc))) {
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
-		reclaim_stat->recent_scanned[0] += nr_anon;
-		reclaim_stat->recent_scanned[1] += nr_file;
+		/* We are about to die and free our memory. Return now. */
+		if (fatal_signal_pending(current))
+			return SWAP_CLUSTER_MAX;
+	}
 
-		spin_unlock_irq(&zone->lru_lock);
 
-		nr_scanned += nr_scan;
-		nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+	lru_add_drain();
+	spin_lock_irq(&zone->lru_lock);
 
+	if (scanning_global_lru(sc)) {
+		nr_taken = isolate_pages_global(nr_to_scan,
+			&page_list, &nr_scanned, sc->order,
+			sc->lumpy_reclaim_mode ?
+				ISOLATE_BOTH : ISOLATE_INACTIVE,
+			zone, 0, file);
+		zone->pages_scanned += nr_scanned;
+		if (current_is_kswapd())
+			__count_zone_vm_events(PGSCAN_KSWAPD, zone,
+					       nr_scanned);
+		else
+			__count_zone_vm_events(PGSCAN_DIRECT, zone,
+					       nr_scanned);
+	} else {
+		nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
+			&page_list, &nr_scanned, sc->order,
+			sc->lumpy_reclaim_mode ?
+				ISOLATE_BOTH : ISOLATE_INACTIVE,
+			zone, sc->mem_cgroup,
+			0, file);
 		/*
-		 * If we are direct reclaiming for contiguous pages and we do
-		 * not reclaim everything in the list, try again and wait
-		 * for IO to complete. This will stall high-order allocations
-		 * but that should be acceptable to the caller
+		 * mem_cgroup_isolate_pages() keeps track of
+		 * scanned pages on its own.
 		 */
-		if (nr_freed < nr_taken && !current_is_kswapd() &&
-		    sc->lumpy_reclaim_mode) {
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+	}
 
-			/*
-			 * The attempt at page out may have made some
-			 * of the pages active, mark them inactive again.
-			 */
-			nr_active = clear_active_flags(&page_list, count);
-			count_vm_events(PGDEACTIVATE, nr_active);
+	if (nr_taken == 0) {
+		spin_unlock_irq(&zone->lru_lock);
+		return 0;
+	}
 
-			nr_freed += shrink_page_list(&page_list, sc,
-							PAGEOUT_IO_SYNC);
-		}
+	update_isolated_counts(zone, sc, &nr_anon, &nr_file, &page_list);
 
-		nr_reclaimed += nr_freed;
+	spin_unlock_irq(&zone->lru_lock);
 
-		local_irq_disable();
-		if (current_is_kswapd())
-			__count_vm_events(KSWAPD_STEAL, nr_freed);
-		__count_zone_vm_events(PGSTEAL, zone, nr_freed);
+	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+
+	/*
+	 * If we are direct reclaiming for contiguous pages and we do
+	 * not reclaim everything in the list, try again and wait
+	 * for IO to complete. This will stall high-order allocations
+	 * but that should be acceptable to the caller
+	 */
+	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
+			sc->lumpy_reclaim_mode) {
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
-		spin_lock(&zone->lru_lock);
 		/*
-		 * Put back any unfreeable pages.
+		 * The attempt at page out may have made some
+		 * of the pages active, mark them inactive again.
 		 */
-		while (!list_empty(&page_list)) {
-			int lru;
-			page = lru_to_page(&page_list);
-			VM_BUG_ON(PageLRU(page));
-			list_del(&page->lru);
-			if (unlikely(!page_evictable(page, NULL))) {
-				spin_unlock_irq(&zone->lru_lock);
-				putback_lru_page(page);
-				spin_lock_irq(&zone->lru_lock);
-				continue;
-			}
-			SetPageLRU(page);
-			lru = page_lru(page);
-			add_page_to_lru_list(zone, page, lru);
-			if (is_active_lru(lru)) {
-				int file = is_file_lru(lru);
-				reclaim_stat->recent_rotated[file]++;
-			}
-			if (!pagevec_add(&pvec, page)) {
-				spin_unlock_irq(&zone->lru_lock);
-				__pagevec_release(&pvec);
-				spin_lock_irq(&zone->lru_lock);
-			}
-		}
-		__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
-		__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
+		nr_active = clear_active_flags(&page_list, NULL);
+		count_vm_events(PGDEACTIVATE, nr_active);
 
-  	} while (nr_scanned < max_scan);
+		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+	}
 
-done:
-	spin_unlock_irq(&zone->lru_lock);
-	pagevec_release(&pvec);
-	return nr_reclaimed;
-}
+	local_irq_disable();
+	if (current_is_kswapd())
+		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
+	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-/*
- * We are about to scan this zone at a certain priority level.  If that priority
- * level is smaller (ie: more urgent) than the previous priority, then note
- * that priority level within the zone.  This is done so that when the next
- * process comes in to scan this zone, it will immediately start out at this
- * priority level rather than having to build up its own scanning priority.
- * Here, this priority affects only the reclaim-mapped threshold.
- */
-static inline void note_zone_scanning_priority(struct zone *zone, int priority)
-{
-	if (priority < zone->prev_priority)
-		zone->prev_priority = priority;
+	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
+	return nr_reclaimed;
 }
 
 /*
@@ -1727,13 +1770,12 @@ static void shrink_zone(int priority, struct zone *zone,
 static bool shrink_zones(int priority, struct zonelist *zonelist,
 					struct scan_control *sc)
 {
-	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	struct zoneref *z;
 	struct zone *zone;
 	bool all_unreclaimable = true;
 
-	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
-					sc->nodemask) {
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
 			continue;
 		/*
@@ -1743,17 +1785,8 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
 		if (scanning_global_lru(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
-			note_zone_scanning_priority(zone, priority);
-
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;	/* Let kswapd poll it */
-		} else {
-			/*
-			 * Ignore cpuset limitation here. We just want to reduce
-			 * # of used pages by us regardless of memory shortage.
-			 */
-			mem_cgroup_note_reclaim_priority(sc->mem_cgroup,
-							priority);
 		}
 
 		shrink_zone(priority, zone, sc);
@@ -1788,7 +1821,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	unsigned long lru_pages = 0;
 	struct zoneref *z;
 	struct zone *zone;
-	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	unsigned long writeback_threshold;
 
 	get_mems_allowed();
@@ -1800,7 +1832,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	 * mem_cgroup will not do shrink_slab.
 	 */
 	if (scanning_global_lru(sc)) {
-		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+		for_each_zone_zonelist(zone, z, zonelist,
+				gfp_zone(sc->gfp_mask)) {
 
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
@@ -1859,17 +1892,6 @@ out:
 	if (priority < 0)
 		priority = 0;
 
-	if (scanning_global_lru(sc)) {
-		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-
-			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-				continue;
-
-			zone->prev_priority = priority;
-		}
-	} else
-		mem_cgroup_record_reclaim_priority(sc->mem_cgroup, priority);
-
 	delayacct_freepages_end();
 	put_mems_allowed();
 
@@ -1886,6 +1908,7 @@ out:
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 				gfp_t gfp_mask, nodemask_t *nodemask)
 {
+	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
 		.may_writepage = !laptop_mode,
@@ -1898,7 +1921,15 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.nodemask = nodemask,
 	};
 
-	return do_try_to_free_pages(zonelist, &sc);
+	trace_mm_vmscan_direct_reclaim_begin(order,
+				sc.may_writepage,
+				gfp_mask);
+
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+
+	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
+
+	return nr_reclaimed;
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
@@ -2026,22 +2057,12 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 		.order = order,
 		.mem_cgroup = NULL,
 	};
-	/*
-	 * temp_priority is used to remember the scanning priority at which
-	 * this zone was successfully refilled to
-	 * free_pages == high_wmark_pages(zone).
-	 */
-	int temp_priority[MAX_NR_ZONES];
-
 loop_again:
 	total_scanned = 0;
 	sc.nr_reclaimed = 0;
 	sc.may_writepage = !laptop_mode;
 	count_vm_event(PAGEOUTRUN);
 
-	for (i = 0; i < pgdat->nr_zones; i++)
-		temp_priority[i] = DEF_PRIORITY;
-
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 		unsigned long lru_pages = 0;
@@ -2109,9 +2130,7 @@ loop_again:
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;
 
-			temp_priority[i] = priority;
 			sc.nr_scanned = 0;
-			note_zone_scanning_priority(zone, priority);
 
 			nid = pgdat->node_id;
 			zid = zone_idx(zone);
@@ -2184,16 +2203,6 @@ loop_again:
 			break;
 	}
 out:
-	/*
-	 * Note within each zone the priority level at which this zone was
-	 * brought into a happy state.  So that the next thread which scans this
-	 * zone will start out at that priority level.
-	 */
-	for (i = 0; i < pgdat->nr_zones; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
-		zone->prev_priority = temp_priority[i];
-	}
 	if (!all_zones_ok) {
 		cond_resched();
 
@@ -2297,9 +2306,10 @@ static int kswapd(void *p)
 				 * premature sleep. If not, then go fully
 				 * to sleep until explicitly woken up
 				 */
-				if (!sleeping_prematurely(pgdat, order, remaining))
+				if (!sleeping_prematurely(pgdat, order, remaining)) {
+					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 					schedule();
-				else {
+				} else {
 					if (remaining)
 						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
 					else
@@ -2319,8 +2329,10 @@ static int kswapd(void *p)
 		 * We can speed up thawing tasks if we don't call balance_pgdat
 		 * after returning from the refrigerator
 		 */
-		if (!ret)
+		if (!ret) {
+			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
 			balance_pgdat(pgdat, order);
+		}
 	}
 	return 0;
 }
@@ -2340,6 +2352,7 @@ void wakeup_kswapd(struct zone *zone, int order)
 		return;
 	if (pgdat->kswapd_max_order < order)
 		pgdat->kswapd_max_order = order;
+	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 		return;
 	if (!waitqueue_active(&pgdat->kswapd_wait))
@@ -2609,7 +2622,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		 */
 		priority = ZONE_RECLAIM_PRIORITY;
 		do {
-			note_zone_scanning_priority(zone, priority);
 			shrink_zone(priority, zone, &sc);
 			priority--;
 		} while (priority >= 0 && sc.nr_reclaimed < nr_pages);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7759941..5c0b1b6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -853,11 +853,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 	}
 	seq_printf(m,
 		   "\n  all_unreclaimable: %u"
-		   "\n  prev_priority:     %i"
 		   "\n  start_pfn:         %lu"
 		   "\n  inactive_ratio:    %u",
 		   zone->all_unreclaimable,
-		   zone->prev_priority,
 		   zone->zone_start_pfn,
 		   zone->inactive_ratio);
 	seq_putc(m, '\n');

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 177+ messages in thread

* [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages
  2010-07-19 13:11 ` Mel Gorman
@ 2010-07-19 13:11   ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli, Mel Gorman

It is useful to distinguish between IO for anon and file pages. This
patch updates
vmscan-tracing-add-trace-event-when-a-page-is-written.patch to include
that information. The patches can be merged together.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/vmscan.h |    8 ++++++--
 mm/vmscan.c                   |    1 +
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index f2da66a..110aea2 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -158,23 +158,27 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
 TRACE_EVENT(mm_vmscan_writepage,
 
 	TP_PROTO(struct page *page,
+		int file,
 		int sync_io),
 
-	TP_ARGS(page, sync_io),
+	TP_ARGS(page, file, sync_io),
 
 	TP_STRUCT__entry(
 		__field(struct page *, page)
+		__field(int, file)
 		__field(int, sync_io)
 	),
 
 	TP_fast_assign(
 		__entry->page = page;
+		__entry->file = file;
 		__entry->sync_io = sync_io;
 	),
 
-	TP_printk("page=%p pfn=%lu sync_io=%d",
+	TP_printk("page=%p pfn=%lu file=%d sync_io=%d",
 		__entry->page,
 		page_to_pfn(__entry->page),
+		__entry->file,
 		__entry->sync_io)
 );
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e6ddba9..6587155 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -400,6 +400,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			ClearPageReclaim(page);
 		}
 		trace_mm_vmscan_writepage(page,
+			page_is_file_cache(page),
 			sync_writeback == PAGEOUT_IO_SYNC);
 		inc_zone_page_state(page, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 177+ messages in thread

* [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages
@ 2010-07-19 13:11   ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli, Mel Gorman

It is useful to distinguish between IO for anon and file pages. This
patch updates
vmscan-tracing-add-trace-event-when-a-page-is-written.patch to include
that information. The patches can be merged together.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/vmscan.h |    8 ++++++--
 mm/vmscan.c                   |    1 +
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index f2da66a..110aea2 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -158,23 +158,27 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
 TRACE_EVENT(mm_vmscan_writepage,
 
 	TP_PROTO(struct page *page,
+		int file,
 		int sync_io),
 
-	TP_ARGS(page, sync_io),
+	TP_ARGS(page, file, sync_io),
 
 	TP_STRUCT__entry(
 		__field(struct page *, page)
+		__field(int, file)
 		__field(int, sync_io)
 	),
 
 	TP_fast_assign(
 		__entry->page = page;
+		__entry->file = file;
 		__entry->sync_io = sync_io;
 	),
 
-	TP_printk("page=%p pfn=%lu sync_io=%d",
+	TP_printk("page=%p pfn=%lu file=%d sync_io=%d",
 		__entry->page,
 		page_to_pfn(__entry->page),
+		__entry->file,
 		__entry->sync_io)
 );
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e6ddba9..6587155 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -400,6 +400,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			ClearPageReclaim(page);
 		}
 		trace_mm_vmscan_writepage(page,
+			page_is_file_cache(page),
 			sync_writeback == PAGEOUT_IO_SYNC);
 		inc_zone_page_state(page, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 177+ messages in thread

* [PATCH 3/8] vmscan: tracing: Update post-processing script to distinguish between anon and file IO from page reclaim
  2010-07-19 13:11 ` Mel Gorman
@ 2010-07-19 13:11   ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli, Mel Gorman

It is useful to distinguish between IO for anon and file pages. This patch
updates
vmscan-tracing-add-a-postprocessing-script-for-reclaim-related-ftrace-events.patch
so the post-processing script can handle the additional information.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 .../trace/postprocess/trace-vmscan-postprocess.pl  |   89 +++++++++++++-------
 1 files changed, 57 insertions(+), 32 deletions(-)

diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
index d1ddc33..7795a9b 100644
--- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -21,9 +21,12 @@ use constant MM_VMSCAN_KSWAPD_SLEEP		=> 4;
 use constant MM_VMSCAN_LRU_SHRINK_ACTIVE	=> 5;
 use constant MM_VMSCAN_LRU_SHRINK_INACTIVE	=> 6;
 use constant MM_VMSCAN_LRU_ISOLATE		=> 7;
-use constant MM_VMSCAN_WRITEPAGE_SYNC		=> 8;
-use constant MM_VMSCAN_WRITEPAGE_ASYNC		=> 9;
-use constant EVENT_UNKNOWN			=> 10;
+use constant MM_VMSCAN_WRITEPAGE_FILE_SYNC	=> 8;
+use constant MM_VMSCAN_WRITEPAGE_ANON_SYNC	=> 9;
+use constant MM_VMSCAN_WRITEPAGE_FILE_ASYNC	=> 10;
+use constant MM_VMSCAN_WRITEPAGE_ANON_ASYNC	=> 11;
+use constant MM_VMSCAN_WRITEPAGE_ASYNC		=> 12;
+use constant EVENT_UNKNOWN			=> 13;
 
 # Per-order events
 use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER => 11;
@@ -55,9 +58,11 @@ my $opt_read_procstat;
 my $total_wakeup_kswapd;
 my ($total_direct_reclaim, $total_direct_nr_scanned);
 my ($total_direct_latency, $total_kswapd_latency);
-my ($total_direct_writepage_sync, $total_direct_writepage_async);
+my ($total_direct_writepage_file_sync, $total_direct_writepage_file_async);
+my ($total_direct_writepage_anon_sync, $total_direct_writepage_anon_async);
 my ($total_kswapd_nr_scanned, $total_kswapd_wake);
-my ($total_kswapd_writepage_sync, $total_kswapd_writepage_async);
+my ($total_kswapd_writepage_file_sync, $total_kswapd_writepage_file_async);
+my ($total_kswapd_writepage_anon_sync, $total_kswapd_writepage_anon_async);
 
 # Catch sigint and exit on request
 my $sigint_report = 0;
@@ -101,7 +106,7 @@ my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
 my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)';
 my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)';
 my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
-my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) sync_io=([0-9]*)';
+my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) file=([0-9]) sync_io=([0-9]*)';
 
 # Dyanically discovered regex
 my $regex_direct_begin;
@@ -209,7 +214,7 @@ $regex_lru_shrink_active = generate_traceevent_regex(
 $regex_writepage = generate_traceevent_regex(
 			"vmscan/mm_vmscan_writepage",
 			$regex_writepage_default,
-			"page", "pfn", "sync_io");
+			"page", "pfn", "file", "sync_io");
 
 sub read_statline($) {
 	my $pid = $_[0];
@@ -379,11 +384,20 @@ EVENT_PROCESS:
 				next;
 			}
 
-			my $sync_io = $3;
+			my $file = $3;
+			my $sync_io = $4;
 			if ($sync_io) {
-				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}++;
+				if ($file) {
+					$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}++;
+				} else {
+					$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}++;
+				}
 			} else {
-				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}++;
+				if ($file) {
+					$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}++;
+				} else {
+					$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}++;
+				}
 			}
 		} else {
 			$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
@@ -427,7 +441,7 @@ sub dump_stats {
 		while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] ||
 			defined $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) {
 
-			if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
+			if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) { 
 				printf("%s ", $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) if !$opt_ignorepid;
 				my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
 				$total_direct_latency += $latency;
@@ -454,8 +468,11 @@ sub dump_stats {
 		$total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
 		$total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
 		$total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
-		$total_direct_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
-		$total_direct_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+		$total_direct_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
+		$total_direct_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
+		$total_direct_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
+
+		$total_direct_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
 
 		my $index = 0;
 		my $this_reclaim_delay = 0;
@@ -470,8 +487,8 @@ sub dump_stats {
 			$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
 			$stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
 			$stats{$process_pid}->{HIGH_NR_SCANNED},
-			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
-			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC},
 			$this_reclaim_delay / 1000);
 
 		if ($stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
@@ -515,16 +532,18 @@ sub dump_stats {
 
 		$total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
 		$total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
-		$total_kswapd_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
-		$total_kswapd_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+		$total_kswapd_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
+		$total_kswapd_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
+		$total_kswapd_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
+		$total_kswapd_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
 
 		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8i %8u",
 			$process_pid,
 			$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
 			$stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
 			$stats{$process_pid}->{HIGH_NR_SCANNED},
-			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
-			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC});
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC});
 
 		if ($stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
 			print "      ";
@@ -551,18 +570,22 @@ sub dump_stats {
 	$total_direct_latency /= 1000;
 	$total_kswapd_latency /= 1000;
 	print "\nSummary\n";
-	print "Direct reclaims:     		$total_direct_reclaim\n";
-	print "Direct reclaim pages scanned:	$total_direct_nr_scanned\n";
-	print "Direct reclaim write sync I/O:	$total_direct_writepage_sync\n";
-	print "Direct reclaim write async I/O:	$total_direct_writepage_async\n";
-	print "Wake kswapd requests:		$total_wakeup_kswapd\n";
-	printf "Time stalled direct reclaim: 	%-1.2f ms\n", $total_direct_latency;
+	print "Direct reclaims:     			$total_direct_reclaim\n";
+	print "Direct reclaim pages scanned:		$total_direct_nr_scanned\n";
+	print "Direct reclaim write file sync I/O:	$total_direct_writepage_file_sync\n";
+	print "Direct reclaim write anon sync I/O:	$total_direct_writepage_anon_sync\n";
+	print "Direct reclaim write file async I/O:	$total_direct_writepage_file_async\n";
+	print "Direct reclaim write anon async I/O:	$total_direct_writepage_anon_async\n";
+	print "Wake kswapd requests:			$total_wakeup_kswapd\n";
+	printf "Time stalled direct reclaim: 		%-1.2f ms\n", $total_direct_latency;
 	print "\n";
-	print "Kswapd wakeups:			$total_kswapd_wake\n";
-	print "Kswapd pages scanned:		$total_kswapd_nr_scanned\n";
-	print "Kswapd reclaim write sync I/O:	$total_kswapd_writepage_sync\n";
-	print "Kswapd reclaim write async I/O:	$total_kswapd_writepage_async\n";
-	printf "Time kswapd awake:		%-1.2f ms\n", $total_kswapd_latency;
+	print "Kswapd wakeups:				$total_kswapd_wake\n";
+	print "Kswapd pages scanned:			$total_kswapd_nr_scanned\n";
+	print "Kswapd reclaim write file sync I/O:	$total_kswapd_writepage_file_sync\n";
+	print "Kswapd reclaim write anon sync I/O:	$total_kswapd_writepage_anon_sync\n";
+	print "Kswapd reclaim write file async I/O:	$total_kswapd_writepage_file_async\n";
+	print "Kswapd reclaim write anon async I/O:	$total_kswapd_writepage_anon_async\n";
+	printf "Time kswapd awake:			%-1.2f ms\n", $total_kswapd_latency;
 }
 
 sub aggregate_perprocesspid() {
@@ -582,8 +605,10 @@ sub aggregate_perprocesspid() {
 		$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
 		$perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
 		$perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
-		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
-		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
 
 		for (my $order = 0; $order < 20; $order++) {
 			$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 177+ messages in thread

* [PATCH 3/8] vmscan: tracing: Update post-processing script to distinguish between anon and file IO from page reclaim
@ 2010-07-19 13:11   ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli, Mel Gorman

It is useful to distinguish between IO for anon and file pages. This patch
updates
vmscan-tracing-add-a-postprocessing-script-for-reclaim-related-ftrace-events.patch
so the post-processing script can handle the additional information.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 .../trace/postprocess/trace-vmscan-postprocess.pl  |   89 +++++++++++++-------
 1 files changed, 57 insertions(+), 32 deletions(-)

diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
index d1ddc33..7795a9b 100644
--- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -21,9 +21,12 @@ use constant MM_VMSCAN_KSWAPD_SLEEP		=> 4;
 use constant MM_VMSCAN_LRU_SHRINK_ACTIVE	=> 5;
 use constant MM_VMSCAN_LRU_SHRINK_INACTIVE	=> 6;
 use constant MM_VMSCAN_LRU_ISOLATE		=> 7;
-use constant MM_VMSCAN_WRITEPAGE_SYNC		=> 8;
-use constant MM_VMSCAN_WRITEPAGE_ASYNC		=> 9;
-use constant EVENT_UNKNOWN			=> 10;
+use constant MM_VMSCAN_WRITEPAGE_FILE_SYNC	=> 8;
+use constant MM_VMSCAN_WRITEPAGE_ANON_SYNC	=> 9;
+use constant MM_VMSCAN_WRITEPAGE_FILE_ASYNC	=> 10;
+use constant MM_VMSCAN_WRITEPAGE_ANON_ASYNC	=> 11;
+use constant MM_VMSCAN_WRITEPAGE_ASYNC		=> 12;
+use constant EVENT_UNKNOWN			=> 13;
 
 # Per-order events
 use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER => 11;
@@ -55,9 +58,11 @@ my $opt_read_procstat;
 my $total_wakeup_kswapd;
 my ($total_direct_reclaim, $total_direct_nr_scanned);
 my ($total_direct_latency, $total_kswapd_latency);
-my ($total_direct_writepage_sync, $total_direct_writepage_async);
+my ($total_direct_writepage_file_sync, $total_direct_writepage_file_async);
+my ($total_direct_writepage_anon_sync, $total_direct_writepage_anon_async);
 my ($total_kswapd_nr_scanned, $total_kswapd_wake);
-my ($total_kswapd_writepage_sync, $total_kswapd_writepage_async);
+my ($total_kswapd_writepage_file_sync, $total_kswapd_writepage_file_async);
+my ($total_kswapd_writepage_anon_sync, $total_kswapd_writepage_anon_async);
 
 # Catch sigint and exit on request
 my $sigint_report = 0;
@@ -101,7 +106,7 @@ my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
 my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)';
 my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)';
 my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
-my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) sync_io=([0-9]*)';
+my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) file=([0-9]) sync_io=([0-9]*)';
 
 # Dyanically discovered regex
 my $regex_direct_begin;
@@ -209,7 +214,7 @@ $regex_lru_shrink_active = generate_traceevent_regex(
 $regex_writepage = generate_traceevent_regex(
 			"vmscan/mm_vmscan_writepage",
 			$regex_writepage_default,
-			"page", "pfn", "sync_io");
+			"page", "pfn", "file", "sync_io");
 
 sub read_statline($) {
 	my $pid = $_[0];
@@ -379,11 +384,20 @@ EVENT_PROCESS:
 				next;
 			}
 
-			my $sync_io = $3;
+			my $file = $3;
+			my $sync_io = $4;
 			if ($sync_io) {
-				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}++;
+				if ($file) {
+					$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}++;
+				} else {
+					$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}++;
+				}
 			} else {
-				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}++;
+				if ($file) {
+					$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}++;
+				} else {
+					$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}++;
+				}
 			}
 		} else {
 			$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
@@ -427,7 +441,7 @@ sub dump_stats {
 		while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] ||
 			defined $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) {
 
-			if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
+			if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) { 
 				printf("%s ", $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) if !$opt_ignorepid;
 				my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
 				$total_direct_latency += $latency;
@@ -454,8 +468,11 @@ sub dump_stats {
 		$total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
 		$total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
 		$total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
-		$total_direct_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
-		$total_direct_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+		$total_direct_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
+		$total_direct_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
+		$total_direct_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
+
+		$total_direct_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
 
 		my $index = 0;
 		my $this_reclaim_delay = 0;
@@ -470,8 +487,8 @@ sub dump_stats {
 			$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
 			$stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
 			$stats{$process_pid}->{HIGH_NR_SCANNED},
-			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
-			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC},
 			$this_reclaim_delay / 1000);
 
 		if ($stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
@@ -515,16 +532,18 @@ sub dump_stats {
 
 		$total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
 		$total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
-		$total_kswapd_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
-		$total_kswapd_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+		$total_kswapd_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
+		$total_kswapd_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
+		$total_kswapd_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
+		$total_kswapd_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
 
 		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8i %8u",
 			$process_pid,
 			$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
 			$stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
 			$stats{$process_pid}->{HIGH_NR_SCANNED},
-			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
-			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC});
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC});
 
 		if ($stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
 			print "      ";
@@ -551,18 +570,22 @@ sub dump_stats {
 	$total_direct_latency /= 1000;
 	$total_kswapd_latency /= 1000;
 	print "\nSummary\n";
-	print "Direct reclaims:     		$total_direct_reclaim\n";
-	print "Direct reclaim pages scanned:	$total_direct_nr_scanned\n";
-	print "Direct reclaim write sync I/O:	$total_direct_writepage_sync\n";
-	print "Direct reclaim write async I/O:	$total_direct_writepage_async\n";
-	print "Wake kswapd requests:		$total_wakeup_kswapd\n";
-	printf "Time stalled direct reclaim: 	%-1.2f ms\n", $total_direct_latency;
+	print "Direct reclaims:     			$total_direct_reclaim\n";
+	print "Direct reclaim pages scanned:		$total_direct_nr_scanned\n";
+	print "Direct reclaim write file sync I/O:	$total_direct_writepage_file_sync\n";
+	print "Direct reclaim write anon sync I/O:	$total_direct_writepage_anon_sync\n";
+	print "Direct reclaim write file async I/O:	$total_direct_writepage_file_async\n";
+	print "Direct reclaim write anon async I/O:	$total_direct_writepage_anon_async\n";
+	print "Wake kswapd requests:			$total_wakeup_kswapd\n";
+	printf "Time stalled direct reclaim: 		%-1.2f ms\n", $total_direct_latency;
 	print "\n";
-	print "Kswapd wakeups:			$total_kswapd_wake\n";
-	print "Kswapd pages scanned:		$total_kswapd_nr_scanned\n";
-	print "Kswapd reclaim write sync I/O:	$total_kswapd_writepage_sync\n";
-	print "Kswapd reclaim write async I/O:	$total_kswapd_writepage_async\n";
-	printf "Time kswapd awake:		%-1.2f ms\n", $total_kswapd_latency;
+	print "Kswapd wakeups:				$total_kswapd_wake\n";
+	print "Kswapd pages scanned:			$total_kswapd_nr_scanned\n";
+	print "Kswapd reclaim write file sync I/O:	$total_kswapd_writepage_file_sync\n";
+	print "Kswapd reclaim write anon sync I/O:	$total_kswapd_writepage_anon_sync\n";
+	print "Kswapd reclaim write file async I/O:	$total_kswapd_writepage_file_async\n";
+	print "Kswapd reclaim write anon async I/O:	$total_kswapd_writepage_anon_async\n";
+	printf "Time kswapd awake:			%-1.2f ms\n", $total_kswapd_latency;
 }
 
 sub aggregate_perprocesspid() {
@@ -582,8 +605,10 @@ sub aggregate_perprocesspid() {
 		$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
 		$perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
 		$perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
-		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
-		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
 
 		for (my $order = 0; $order < 20; $order++) {
 			$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 177+ messages in thread

* [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-19 13:11 ` Mel Gorman
@ 2010-07-19 13:11   ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli, Mel Gorman

When memory is under enough pressure, a process may enter direct
reclaim to free pages in the same manner kswapd does. If a dirty page is
encountered during the scan, this page is written to backing storage using
mapping->writepage. This can result in very deep call stacks, particularly
if the target storage or filesystem are complex. It has already been observed
on XFS that the stack overflows but the problem is not XFS-specific.

This patch prevents direct reclaim writing back filesystem pages by checking
if current is kswapd or the page is anonymous before writing back.  If the
dirty pages cannot be written back, they are placed back on the LRU lists
for either background writing by the BDI threads or kswapd. If in direct
lumpy reclaim and dirty pages are encountered, the process will stall for
the background flusher before trying to reclaim the pages again.

As the call-chain for writing anonymous pages is not expected to be deep
and they are not cleaned by flusher threads, anonymous pages are still
written back in direct reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |  116 +++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 109 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6587155..bc50937 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -323,6 +323,61 @@ typedef enum {
 	PAGE_CLEAN,
 } pageout_t;
 
+int write_reclaim_page(struct page *page, struct address_space *mapping,
+						enum pageout_io sync_writeback)
+{
+	int res;
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = SWAP_CLUSTER_MAX,
+		.range_start = 0,
+		.range_end = LLONG_MAX,
+		.nonblocking = 1,
+		.for_reclaim = 1,
+	};
+
+	if (!clear_page_dirty_for_io(page))
+		return PAGE_CLEAN;
+
+	SetPageReclaim(page);
+	res = mapping->a_ops->writepage(page, &wbc);
+	if (res < 0)
+		handle_write_error(mapping, page, res);
+	if (res == AOP_WRITEPAGE_ACTIVATE) {
+		ClearPageReclaim(page);
+		return PAGE_ACTIVATE;
+	}
+
+	/*
+	 * Wait on writeback if requested to. This happens when
+	 * direct reclaiming a large contiguous area and the
+	 * first attempt to free a range of pages fails.
+	 */
+	if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
+		wait_on_page_writeback(page);
+
+	if (!PageWriteback(page)) {
+		/* synchronous write or broken a_ops? */
+		ClearPageReclaim(page);
+	}
+	trace_mm_vmscan_writepage(page,
+		page_is_file_cache(page),
+		sync_writeback == PAGEOUT_IO_SYNC);
+	inc_zone_page_state(page, NR_VMSCAN_WRITE);
+
+	return PAGE_SUCCESS;
+}
+
+/*
+ * For now, only kswapd can writeback filesystem pages as otherwise
+ * there is a stack overflow risk
+ */
+static inline bool reclaim_can_writeback(struct scan_control *sc,
+					struct page *page)
+{
+	return !page_is_file_cache(page) || current_is_kswapd();
+}
+
 /*
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
@@ -406,7 +461,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 		return PAGE_SUCCESS;
 	}
 
-	return PAGE_CLEAN;
+	return write_reclaim_page(page, mapping, sync_writeback);
 }
 
 /*
@@ -639,6 +694,9 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
 	pagevec_free(&freed_pvec);
 }
 
+/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
+#define MAX_SWAP_CLEAN_WAIT 50
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -646,13 +704,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 					struct scan_control *sc,
 					enum pageout_io sync_writeback)
 {
-	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
-	int pgactivate = 0;
+	LIST_HEAD(putback_pages);
+	LIST_HEAD(dirty_pages);
+	int pgactivate;
+	int dirty_isolated = 0;
+	unsigned long nr_dirty;
 	unsigned long nr_reclaimed = 0;
 
+	pgactivate = 0;
 	cond_resched();
 
+restart_dirty:
+	nr_dirty = 0;
 	while (!list_empty(page_list)) {
 		enum page_references references;
 		struct address_space *mapping;
@@ -741,7 +805,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (PageDirty(page)) {
+		if (PageDirty(page))  {
+			/*
+			 * If the caller cannot writeback pages, dirty pages
+			 * are put on a separate list for cleaning by either
+			 * a flusher thread or kswapd
+			 */
+			if (!reclaim_can_writeback(sc, page)) {
+				list_add(&page->lru, &dirty_pages);
+				unlock_page(page);
+				nr_dirty++;
+				goto keep_dirty;
+			}
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
@@ -852,13 +928,39 @@ activate_locked:
 keep_locked:
 		unlock_page(page);
 keep:
-		list_add(&page->lru, &ret_pages);
+		list_add(&page->lru, &putback_pages);
+keep_dirty:
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
 
+	if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
+		/*
+		 * Wakeup a flusher thread to clean at least as many dirty
+		 * pages as encountered by direct reclaim. Wait on congestion
+		 * to throttle processes cleaning dirty pages
+		 */
+		wakeup_flusher_threads(nr_dirty);
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
+
+		/*
+		 * As lumpy reclaim and memcg targets specific pages, wait on
+		 * them to be cleaned and try reclaim again.
+		 */
+		if (sync_writeback == PAGEOUT_IO_SYNC ||
+						sc->mem_cgroup != NULL) {
+			dirty_isolated++;
+			list_splice(&dirty_pages, page_list);
+			INIT_LIST_HEAD(&dirty_pages);
+			goto restart_dirty;
+		}
+	}
+
 	free_page_list(&free_pages);
 
-	list_splice(&ret_pages, page_list);
+	if (!list_empty(&dirty_pages))
+		list_splice(&dirty_pages, page_list);
+	list_splice(&putback_pages, page_list);
+
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 177+ messages in thread

* [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-19 13:11   ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli, Mel Gorman

When memory is under enough pressure, a process may enter direct
reclaim to free pages in the same manner kswapd does. If a dirty page is
encountered during the scan, this page is written to backing storage using
mapping->writepage. This can result in very deep call stacks, particularly
if the target storage or filesystem are complex. It has already been observed
on XFS that the stack overflows but the problem is not XFS-specific.

This patch prevents direct reclaim writing back filesystem pages by checking
if current is kswapd or the page is anonymous before writing back.  If the
dirty pages cannot be written back, they are placed back on the LRU lists
for either background writing by the BDI threads or kswapd. If in direct
lumpy reclaim and dirty pages are encountered, the process will stall for
the background flusher before trying to reclaim the pages again.

As the call-chain for writing anonymous pages is not expected to be deep
and they are not cleaned by flusher threads, anonymous pages are still
written back in direct reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |  116 +++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 109 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6587155..bc50937 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -323,6 +323,61 @@ typedef enum {
 	PAGE_CLEAN,
 } pageout_t;
 
+int write_reclaim_page(struct page *page, struct address_space *mapping,
+						enum pageout_io sync_writeback)
+{
+	int res;
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = SWAP_CLUSTER_MAX,
+		.range_start = 0,
+		.range_end = LLONG_MAX,
+		.nonblocking = 1,
+		.for_reclaim = 1,
+	};
+
+	if (!clear_page_dirty_for_io(page))
+		return PAGE_CLEAN;
+
+	SetPageReclaim(page);
+	res = mapping->a_ops->writepage(page, &wbc);
+	if (res < 0)
+		handle_write_error(mapping, page, res);
+	if (res == AOP_WRITEPAGE_ACTIVATE) {
+		ClearPageReclaim(page);
+		return PAGE_ACTIVATE;
+	}
+
+	/*
+	 * Wait on writeback if requested to. This happens when
+	 * direct reclaiming a large contiguous area and the
+	 * first attempt to free a range of pages fails.
+	 */
+	if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
+		wait_on_page_writeback(page);
+
+	if (!PageWriteback(page)) {
+		/* synchronous write or broken a_ops? */
+		ClearPageReclaim(page);
+	}
+	trace_mm_vmscan_writepage(page,
+		page_is_file_cache(page),
+		sync_writeback == PAGEOUT_IO_SYNC);
+	inc_zone_page_state(page, NR_VMSCAN_WRITE);
+
+	return PAGE_SUCCESS;
+}
+
+/*
+ * For now, only kswapd can writeback filesystem pages as otherwise
+ * there is a stack overflow risk
+ */
+static inline bool reclaim_can_writeback(struct scan_control *sc,
+					struct page *page)
+{
+	return !page_is_file_cache(page) || current_is_kswapd();
+}
+
 /*
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
@@ -406,7 +461,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 		return PAGE_SUCCESS;
 	}
 
-	return PAGE_CLEAN;
+	return write_reclaim_page(page, mapping, sync_writeback);
 }
 
 /*
@@ -639,6 +694,9 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
 	pagevec_free(&freed_pvec);
 }
 
+/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
+#define MAX_SWAP_CLEAN_WAIT 50
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -646,13 +704,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 					struct scan_control *sc,
 					enum pageout_io sync_writeback)
 {
-	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
-	int pgactivate = 0;
+	LIST_HEAD(putback_pages);
+	LIST_HEAD(dirty_pages);
+	int pgactivate;
+	int dirty_isolated = 0;
+	unsigned long nr_dirty;
 	unsigned long nr_reclaimed = 0;
 
+	pgactivate = 0;
 	cond_resched();
 
+restart_dirty:
+	nr_dirty = 0;
 	while (!list_empty(page_list)) {
 		enum page_references references;
 		struct address_space *mapping;
@@ -741,7 +805,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (PageDirty(page)) {
+		if (PageDirty(page))  {
+			/*
+			 * If the caller cannot writeback pages, dirty pages
+			 * are put on a separate list for cleaning by either
+			 * a flusher thread or kswapd
+			 */
+			if (!reclaim_can_writeback(sc, page)) {
+				list_add(&page->lru, &dirty_pages);
+				unlock_page(page);
+				nr_dirty++;
+				goto keep_dirty;
+			}
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
@@ -852,13 +928,39 @@ activate_locked:
 keep_locked:
 		unlock_page(page);
 keep:
-		list_add(&page->lru, &ret_pages);
+		list_add(&page->lru, &putback_pages);
+keep_dirty:
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
 
+	if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
+		/*
+		 * Wakeup a flusher thread to clean at least as many dirty
+		 * pages as encountered by direct reclaim. Wait on congestion
+		 * to throttle processes cleaning dirty pages
+		 */
+		wakeup_flusher_threads(nr_dirty);
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
+
+		/*
+		 * As lumpy reclaim and memcg targets specific pages, wait on
+		 * them to be cleaned and try reclaim again.
+		 */
+		if (sync_writeback == PAGEOUT_IO_SYNC ||
+						sc->mem_cgroup != NULL) {
+			dirty_isolated++;
+			list_splice(&dirty_pages, page_list);
+			INIT_LIST_HEAD(&dirty_pages);
+			goto restart_dirty;
+		}
+	}
+
 	free_page_list(&free_pages);
 
-	list_splice(&ret_pages, page_list);
+	if (!list_empty(&dirty_pages))
+		list_splice(&dirty_pages, page_list);
+	list_splice(&putback_pages, page_list);
+
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 177+ messages in thread

* [PATCH 5/8] fs,btrfs: Allow kswapd to writeback pages
  2010-07-19 13:11 ` Mel Gorman
@ 2010-07-19 13:11   ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli, Mel Gorman

As only kswapd and memcg are writing back pages, there should be no
danger of overflowing the stack. Allow the writing back of dirty pages
in btrfs from the VM.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 fs/btrfs/disk-io.c |   21 +--------------------
 fs/btrfs/inode.c   |    6 ------
 2 files changed, 1 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 34f7c37..e4aa547 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -696,26 +696,7 @@ static int btree_writepage(struct page *page, struct writeback_control *wbc)
 	int was_dirty;
 
 	tree = &BTRFS_I(page->mapping->host)->io_tree;
-	if (!(current->flags & PF_MEMALLOC)) {
-		return extent_write_full_page(tree, page,
-					      btree_get_extent, wbc);
-	}
-
-	redirty_page_for_writepage(wbc, page);
-	eb = btrfs_find_tree_block(root, page_offset(page),
-				      PAGE_CACHE_SIZE);
-	WARN_ON(!eb);
-
-	was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->bflags);
-	if (!was_dirty) {
-		spin_lock(&root->fs_info->delalloc_lock);
-		root->fs_info->dirty_metadata_bytes += PAGE_CACHE_SIZE;
-		spin_unlock(&root->fs_info->delalloc_lock);
-	}
-	free_extent_buffer(eb);
-
-	unlock_page(page);
-	return 0;
+	return extent_write_full_page(tree, page, btree_get_extent, wbc);
 }
 
 static int btree_writepages(struct address_space *mapping,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1bff92a..5c0e604 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5859,12 +5859,6 @@ static int btrfs_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct extent_io_tree *tree;
 
-
-	if (current->flags & PF_MEMALLOC) {
-		redirty_page_for_writepage(wbc, page);
-		unlock_page(page);
-		return 0;
-	}
 	tree = &BTRFS_I(page->mapping->host)->io_tree;
 	return extent_write_full_page(tree, page, btrfs_get_extent, wbc);
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 177+ messages in thread

* [PATCH 5/8] fs,btrfs: Allow kswapd to writeback pages
@ 2010-07-19 13:11   ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli, Mel Gorman

As only kswapd and memcg are writing back pages, there should be no
danger of overflowing the stack. Allow the writing back of dirty pages
in btrfs from the VM.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 fs/btrfs/disk-io.c |   21 +--------------------
 fs/btrfs/inode.c   |    6 ------
 2 files changed, 1 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 34f7c37..e4aa547 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -696,26 +696,7 @@ static int btree_writepage(struct page *page, struct writeback_control *wbc)
 	int was_dirty;
 
 	tree = &BTRFS_I(page->mapping->host)->io_tree;
-	if (!(current->flags & PF_MEMALLOC)) {
-		return extent_write_full_page(tree, page,
-					      btree_get_extent, wbc);
-	}
-
-	redirty_page_for_writepage(wbc, page);
-	eb = btrfs_find_tree_block(root, page_offset(page),
-				      PAGE_CACHE_SIZE);
-	WARN_ON(!eb);
-
-	was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->bflags);
-	if (!was_dirty) {
-		spin_lock(&root->fs_info->delalloc_lock);
-		root->fs_info->dirty_metadata_bytes += PAGE_CACHE_SIZE;
-		spin_unlock(&root->fs_info->delalloc_lock);
-	}
-	free_extent_buffer(eb);
-
-	unlock_page(page);
-	return 0;
+	return extent_write_full_page(tree, page, btree_get_extent, wbc);
 }
 
 static int btree_writepages(struct address_space *mapping,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1bff92a..5c0e604 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5859,12 +5859,6 @@ static int btrfs_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct extent_io_tree *tree;
 
-
-	if (current->flags & PF_MEMALLOC) {
-		redirty_page_for_writepage(wbc, page);
-		unlock_page(page);
-		return 0;
-	}
 	tree = &BTRFS_I(page->mapping->host)->io_tree;
 	return extent_write_full_page(tree, page, btrfs_get_extent, wbc);
 }
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 177+ messages in thread

* [PATCH 6/8] fs,xfs: Allow kswapd to writeback pages
  2010-07-19 13:11 ` Mel Gorman
@ 2010-07-19 13:11   ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli, Mel Gorman

As only kswapd and memcg are writing back pages, there should be no
danger of overflowing the stack. Allow the writing back of dirty pages
in xfs from the VM.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 fs/xfs/linux-2.6/xfs_aops.c |   15 ---------------
 1 files changed, 0 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index 34640d6..4c89db3 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -1333,21 +1333,6 @@ xfs_vm_writepage(
 	trace_xfs_writepage(inode, page, 0);
 
 	/*
-	 * Refuse to write the page out if we are called from reclaim context.
-	 *
-	 * This is primarily to avoid stack overflows when called from deep
-	 * used stacks in random callers for direct reclaim, but disabling
-	 * reclaim for kswap is a nice side-effect as kswapd causes rather
-	 * suboptimal I/O patters, too.
-	 *
-	 * This should really be done by the core VM, but until that happens
-	 * filesystems like XFS, btrfs and ext4 have to take care of this
-	 * by themselves.
-	 */
-	if (current->flags & PF_MEMALLOC)
-		goto out_fail;
-
-	/*
 	 * We need a transaction if:
 	 *  1. There are delalloc buffers on the page
 	 *  2. The page is uptodate and we have unmapped buffers
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 177+ messages in thread

* [PATCH 6/8] fs,xfs: Allow kswapd to writeback pages
@ 2010-07-19 13:11   ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli, Mel Gorman

As only kswapd and memcg are writing back pages, there should be no
danger of overflowing the stack. Allow the writing back of dirty pages
in xfs from the VM.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 fs/xfs/linux-2.6/xfs_aops.c |   15 ---------------
 1 files changed, 0 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index 34640d6..4c89db3 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -1333,21 +1333,6 @@ xfs_vm_writepage(
 	trace_xfs_writepage(inode, page, 0);
 
 	/*
-	 * Refuse to write the page out if we are called from reclaim context.
-	 *
-	 * This is primarily to avoid stack overflows when called from deep
-	 * used stacks in random callers for direct reclaim, but disabling
-	 * reclaim for kswap is a nice side-effect as kswapd causes rather
-	 * suboptimal I/O patters, too.
-	 *
-	 * This should really be done by the core VM, but until that happens
-	 * filesystems like XFS, btrfs and ext4 have to take care of this
-	 * by themselves.
-	 */
-	if (current->flags & PF_MEMALLOC)
-		goto out_fail;
-
-	/*
 	 * We need a transaction if:
 	 *  1. There are delalloc buffers on the page
 	 *  2. The page is uptodate and we have unmapped buffers
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 177+ messages in thread

* [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-19 13:11 ` Mel Gorman
@ 2010-07-19 13:11   ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli, Mel Gorman

From: Wu Fengguang <fengguang.wu@intel.com>

A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

This behavior also makes sense from the perspective of page reclaim.
File pages are added to the inactive list and promoted if referenced
after one recycling. If not referenced, it's very easy for pages to be
cleaned from reclaim context which is inefficient in terms of IO. If
background flush is cleaning pages, it's best it cleans old pages to
help minimise IO from reclaim.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 fs/fs-writeback.c |   19 ++++++++++++++++---
 1 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d5be169..cc81c67 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -612,13 +612,14 @@ static long wb_writeback(struct bdi_writeback *wb,
 		.range_cyclic		= work->range_cyclic,
 	};
 	unsigned long oldest_jif;
+	int expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
+	int fg_rounds = 0;
 	long wrote = 0;
 	struct inode *inode;
 
-	if (wbc.for_kupdate) {
+	if (wbc.for_kupdate || wbc.for_background) {
 		wbc.older_than_this = &oldest_jif;
-		oldest_jif = jiffies -
-				msecs_to_jiffies(dirty_expire_interval * 10);
+		oldest_jif = jiffies - expire_interval;
 	}
 	if (!wbc.range_cyclic) {
 		wbc.range_start = 0;
@@ -649,6 +650,18 @@ static long wb_writeback(struct bdi_writeback *wb,
 		work->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 
+		if (work->for_background && expire_interval &&
+		    ++fg_rounds && list_empty(&wb->b_io)) {
+			if (fg_rounds < 10)
+				expire_interval >>= 1;
+			if (expire_interval)
+				oldest_jif = jiffies - expire_interval;
+			else
+				wbc.older_than_this = 0;
+			fg_rounds = 0;
+			continue;
+		}
+
 		/*
 		 * If we consumed everything, see if we have more
 		 */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 177+ messages in thread

* [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-19 13:11   ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli, Mel Gorman

From: Wu Fengguang <fengguang.wu@intel.com>

A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

This behavior also makes sense from the perspective of page reclaim.
File pages are added to the inactive list and promoted if referenced
after one recycling. If not referenced, it's very easy for pages to be
cleaned from reclaim context which is inefficient in terms of IO. If
background flush is cleaning pages, it's best it cleans old pages to
help minimise IO from reclaim.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 fs/fs-writeback.c |   19 ++++++++++++++++---
 1 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d5be169..cc81c67 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -612,13 +612,14 @@ static long wb_writeback(struct bdi_writeback *wb,
 		.range_cyclic		= work->range_cyclic,
 	};
 	unsigned long oldest_jif;
+	int expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
+	int fg_rounds = 0;
 	long wrote = 0;
 	struct inode *inode;
 
-	if (wbc.for_kupdate) {
+	if (wbc.for_kupdate || wbc.for_background) {
 		wbc.older_than_this = &oldest_jif;
-		oldest_jif = jiffies -
-				msecs_to_jiffies(dirty_expire_interval * 10);
+		oldest_jif = jiffies - expire_interval;
 	}
 	if (!wbc.range_cyclic) {
 		wbc.range_start = 0;
@@ -649,6 +650,18 @@ static long wb_writeback(struct bdi_writeback *wb,
 		work->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 
+		if (work->for_background && expire_interval &&
+		    ++fg_rounds && list_empty(&wb->b_io)) {
+			if (fg_rounds < 10)
+				expire_interval >>= 1;
+			if (expire_interval)
+				oldest_jif = jiffies - expire_interval;
+			else
+				wbc.older_than_this = 0;
+			fg_rounds = 0;
+			continue;
+		}
+
 		/*
 		 * If we consumed everything, see if we have more
 		 */
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 177+ messages in thread

* [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-19 13:11 ` Mel Gorman
@ 2010-07-19 13:11   ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli, Mel Gorman

There are a number of cases where pages get cleaned but two of concern
to this patch are;
  o When dirtying pages, processes may be throttled to clean pages if
    dirty_ratio is not met.
  o Pages belonging to inodes dirtied longer than
    dirty_writeback_centisecs get cleaned.

The problem for reclaim is that dirty pages can reach the end of the LRU
if pages are being dirtied slowly so that neither the throttling cleans
them or a flusher thread waking periodically.

Background flush is already cleaning old or expired inodes first but the
expire time is too far in the future at the time of page reclaim. To mitigate
future problems, this patch wakes flusher threads to clean 1.5 times the
number of dirty pages encountered by reclaimers. The reasoning is that pages
were being dirtied at a roughly constant rate recently so if N dirty pages
were encountered in this scan block, we are likely to see roughly N dirty
pages again soon so try keep the flusher threads ahead of reclaim.

This is unfortunately very hand-wavy but there is not really a good way of
quantifying how bad it is when reclaim encounters dirty pages other than
"down with that sort of thing". Similarly, there is not an obvious way of
figuring how what percentage of dirty pages are old in terms of LRU-age and
should be cleaned. Ideally, the background flushers would only be cleaning
pages belonging to the zone being scanned but it's not clear if this would
be of benefit (less IO) or not (potentially less efficient IO if an inode
is scattered across multiple zones).

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   18 +++++++++++-------
 1 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bc50937..5763719 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -806,6 +806,8 @@ restart_dirty:
 		}
 
 		if (PageDirty(page))  {
+			nr_dirty++;
+
 			/*
 			 * If the caller cannot writeback pages, dirty pages
 			 * are put on a separate list for cleaning by either
@@ -814,7 +816,6 @@ restart_dirty:
 			if (!reclaim_can_writeback(sc, page)) {
 				list_add(&page->lru, &dirty_pages);
 				unlock_page(page);
-				nr_dirty++;
 				goto keep_dirty;
 			}
 
@@ -933,13 +934,16 @@ keep_dirty:
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
 
+	/*
+	 * If reclaim is encountering dirty pages, it may be because
+	 * dirty pages are reaching the end of the LRU even though
+	 * the dirty_ratio may be satisified. In this case, wake
+	 * flusher threads to pro-actively clean some pages
+	 */
+	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
+
 	if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
-		/*
-		 * Wakeup a flusher thread to clean at least as many dirty
-		 * pages as encountered by direct reclaim. Wait on congestion
-		 * to throttle processes cleaning dirty pages
-		 */
-		wakeup_flusher_threads(nr_dirty);
+		/* Throttle direct reclaimers cleaning pages */
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 177+ messages in thread

* [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-19 13:11   ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli, Mel Gorman

There are a number of cases where pages get cleaned but two of concern
to this patch are;
  o When dirtying pages, processes may be throttled to clean pages if
    dirty_ratio is not met.
  o Pages belonging to inodes dirtied longer than
    dirty_writeback_centisecs get cleaned.

The problem for reclaim is that dirty pages can reach the end of the LRU
if pages are being dirtied slowly so that neither the throttling cleans
them or a flusher thread waking periodically.

Background flush is already cleaning old or expired inodes first but the
expire time is too far in the future at the time of page reclaim. To mitigate
future problems, this patch wakes flusher threads to clean 1.5 times the
number of dirty pages encountered by reclaimers. The reasoning is that pages
were being dirtied at a roughly constant rate recently so if N dirty pages
were encountered in this scan block, we are likely to see roughly N dirty
pages again soon so try keep the flusher threads ahead of reclaim.

This is unfortunately very hand-wavy but there is not really a good way of
quantifying how bad it is when reclaim encounters dirty pages other than
"down with that sort of thing". Similarly, there is not an obvious way of
figuring how what percentage of dirty pages are old in terms of LRU-age and
should be cleaned. Ideally, the background flushers would only be cleaning
pages belonging to the zone being scanned but it's not clear if this would
be of benefit (less IO) or not (potentially less efficient IO if an inode
is scattered across multiple zones).

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   18 +++++++++++-------
 1 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bc50937..5763719 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -806,6 +806,8 @@ restart_dirty:
 		}
 
 		if (PageDirty(page))  {
+			nr_dirty++;
+
 			/*
 			 * If the caller cannot writeback pages, dirty pages
 			 * are put on a separate list for cleaning by either
@@ -814,7 +816,6 @@ restart_dirty:
 			if (!reclaim_can_writeback(sc, page)) {
 				list_add(&page->lru, &dirty_pages);
 				unlock_page(page);
-				nr_dirty++;
 				goto keep_dirty;
 			}
 
@@ -933,13 +934,16 @@ keep_dirty:
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
 
+	/*
+	 * If reclaim is encountering dirty pages, it may be because
+	 * dirty pages are reaching the end of the LRU even though
+	 * the dirty_ratio may be satisified. In this case, wake
+	 * flusher threads to pro-actively clean some pages
+	 */
+	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
+
 	if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
-		/*
-		 * Wakeup a flusher thread to clean at least as many dirty
-		 * pages as encountered by direct reclaim. Wait on congestion
-		 * to throttle processes cleaning dirty pages
-		 */
-		wakeup_flusher_threads(nr_dirty);
+		/* Throttle direct reclaimers cleaning pages */
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 177+ messages in thread

* Re: [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages
  2010-07-19 13:11   ` Mel Gorman
@ 2010-07-19 13:24     ` Rik van Riel
  -1 siblings, 0 replies; 177+ messages in thread
From: Rik van Riel @ 2010-07-19 13:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On 07/19/2010 09:11 AM, Mel Gorman wrote:
> It is useful to distinguish between IO for anon and file pages. This
> patch updates
> vmscan-tracing-add-trace-event-when-a-page-is-written.patch to include
> that information. The patches can be merged together.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages
@ 2010-07-19 13:24     ` Rik van Riel
  0 siblings, 0 replies; 177+ messages in thread
From: Rik van Riel @ 2010-07-19 13:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On 07/19/2010 09:11 AM, Mel Gorman wrote:
> It is useful to distinguish between IO for anon and file pages. This
> patch updates
> vmscan-tracing-add-trace-event-when-a-page-is-written.patch to include
> that information. The patches can be merged together.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 3/8] vmscan: tracing: Update post-processing script to distinguish between anon and file IO from page reclaim
  2010-07-19 13:11   ` Mel Gorman
@ 2010-07-19 13:32     ` Rik van Riel
  -1 siblings, 0 replies; 177+ messages in thread
From: Rik van Riel @ 2010-07-19 13:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On 07/19/2010 09:11 AM, Mel Gorman wrote:
> It is useful to distinguish between IO for anon and file pages. This patch
> updates
> vmscan-tracing-add-a-postprocessing-script-for-reclaim-related-ftrace-events.patch
> so the post-processing script can handle the additional information.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 3/8] vmscan: tracing: Update post-processing script to distinguish between anon and file IO from page reclaim
@ 2010-07-19 13:32     ` Rik van Riel
  0 siblings, 0 replies; 177+ messages in thread
From: Rik van Riel @ 2010-07-19 13:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On 07/19/2010 09:11 AM, Mel Gorman wrote:
> It is useful to distinguish between IO for anon and file pages. This patch
> updates
> vmscan-tracing-add-a-postprocessing-script-for-reclaim-related-ftrace-events.patch
> so the post-processing script can handle the additional information.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages
  2010-07-19 13:11   ` Mel Gorman
@ 2010-07-19 14:15     ` Christoph Hellwig
  -1 siblings, 0 replies; 177+ messages in thread
From: Christoph Hellwig @ 2010-07-19 14:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 02:11:24PM +0100, Mel Gorman wrote:
> It is useful to distinguish between IO for anon and file pages. This
> patch updates
> vmscan-tracing-add-trace-event-when-a-page-is-written.patch to include
> that information. The patches can be merged together.

I think the trace would be nicer if you #define flags for both
cases and then use __print_flags on them.  That'll also make it more
extensible in case we need to add more flags later.

And a purely procedural question:  This is supposed to get rolled into
the original patch before it gets commited to a git tree, right?


^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages
@ 2010-07-19 14:15     ` Christoph Hellwig
  0 siblings, 0 replies; 177+ messages in thread
From: Christoph Hellwig @ 2010-07-19 14:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 02:11:24PM +0100, Mel Gorman wrote:
> It is useful to distinguish between IO for anon and file pages. This
> patch updates
> vmscan-tracing-add-trace-event-when-a-page-is-written.patch to include
> that information. The patches can be merged together.

I think the trace would be nicer if you #define flags for both
cases and then use __print_flags on them.  That'll also make it more
extensible in case we need to add more flags later.

And a purely procedural question:  This is supposed to get rolled into
the original patch before it gets commited to a git tree, right?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-19 13:11   ` Mel Gorman
@ 2010-07-19 14:19     ` Christoph Hellwig
  -1 siblings, 0 replies; 177+ messages in thread
From: Christoph Hellwig @ 2010-07-19 14:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 02:11:26PM +0100, Mel Gorman wrote:
> As the call-chain for writing anonymous pages is not expected to be deep
> and they are not cleaned by flusher threads, anonymous pages are still
> written back in direct reclaim.

While it is not quite as deep as it skips the filesystem allocator and
extent mapping code it can still be quite deep for swap given that it
still has to traverse the whole I/O stack.  Probably not worth worrying
about now, but we need to keep an eye on it.

The patch looks fine to me anyway.


^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-19 14:19     ` Christoph Hellwig
  0 siblings, 0 replies; 177+ messages in thread
From: Christoph Hellwig @ 2010-07-19 14:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 02:11:26PM +0100, Mel Gorman wrote:
> As the call-chain for writing anonymous pages is not expected to be deep
> and they are not cleaned by flusher threads, anonymous pages are still
> written back in direct reclaim.

While it is not quite as deep as it skips the filesystem allocator and
extent mapping code it can still be quite deep for swap given that it
still has to traverse the whole I/O stack.  Probably not worth worrying
about now, but we need to keep an eye on it.

The patch looks fine to me anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 6/8] fs,xfs: Allow kswapd to writeback pages
  2010-07-19 13:11   ` Mel Gorman
@ 2010-07-19 14:20     ` Christoph Hellwig
  -1 siblings, 0 replies; 177+ messages in thread
From: Christoph Hellwig @ 2010-07-19 14:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 02:11:28PM +0100, Mel Gorman wrote:
> As only kswapd and memcg are writing back pages, there should be no
> danger of overflowing the stack. Allow the writing back of dirty pages
> in xfs from the VM.

As pointed out during the discussion on one of your previous post memcg
does pose a huge risk of stack overflows.  In the XFS tree we've already
relaxed the check to allow writeback from kswapd, and until the memcg
situation we'll need to keep that check.


^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 6/8] fs,xfs: Allow kswapd to writeback pages
@ 2010-07-19 14:20     ` Christoph Hellwig
  0 siblings, 0 replies; 177+ messages in thread
From: Christoph Hellwig @ 2010-07-19 14:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 02:11:28PM +0100, Mel Gorman wrote:
> As only kswapd and memcg are writing back pages, there should be no
> danger of overflowing the stack. Allow the writing back of dirty pages
> in xfs from the VM.

As pointed out during the discussion on one of your previous post memcg
does pose a huge risk of stack overflows.  In the XFS tree we've already
relaxed the check to allow writeback from kswapd, and until the memcg
situation we'll need to keep that check.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-19 13:11   ` Mel Gorman
@ 2010-07-19 14:21     ` Christoph Hellwig
  -1 siblings, 0 replies; 177+ messages in thread
From: Christoph Hellwig @ 2010-07-19 14:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 02:11:29PM +0100, Mel Gorman wrote:
> From: Wu Fengguang <fengguang.wu@intel.com>
> 
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.
> 
> This behavior also makes sense from the perspective of page reclaim.
> File pages are added to the inactive list and promoted if referenced
> after one recycling. If not referenced, it's very easy for pages to be
> cleaned from reclaim context which is inefficient in terms of IO. If
> background flush is cleaning pages, it's best it cleans old pages to
> help minimise IO from reclaim.

Yes, we absolutely do this.  Wu, do you have an improved version of the
pending or should we put it in this version for now?


^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-19 14:21     ` Christoph Hellwig
  0 siblings, 0 replies; 177+ messages in thread
From: Christoph Hellwig @ 2010-07-19 14:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 02:11:29PM +0100, Mel Gorman wrote:
> From: Wu Fengguang <fengguang.wu@intel.com>
> 
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.
> 
> This behavior also makes sense from the perspective of page reclaim.
> File pages are added to the inactive list and promoted if referenced
> after one recycling. If not referenced, it's very easy for pages to be
> cleaned from reclaim context which is inefficient in terms of IO. If
> background flush is cleaning pages, it's best it cleans old pages to
> help minimise IO from reclaim.

Yes, we absolutely do this.  Wu, do you have an improved version of the
pending or should we put it in this version for now?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-19 13:11   ` Mel Gorman
@ 2010-07-19 14:23     ` Christoph Hellwig
  -1 siblings, 0 replies; 177+ messages in thread
From: Christoph Hellwig @ 2010-07-19 14:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 02:11:30PM +0100, Mel Gorman wrote:
> +	/*
> +	 * If reclaim is encountering dirty pages, it may be because
> +	 * dirty pages are reaching the end of the LRU even though
> +	 * the dirty_ratio may be satisified. In this case, wake
> +	 * flusher threads to pro-actively clean some pages
> +	 */
> +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> +

Where is the laptop-mode magic coming from?

And btw, at least currently wakeup_flusher_threads writes back nr_pages
for each BDI, which might not be what you want.  Then again probably
no caller wants it, but I don't see an easy way to fix it.


^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-19 14:23     ` Christoph Hellwig
  0 siblings, 0 replies; 177+ messages in thread
From: Christoph Hellwig @ 2010-07-19 14:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 02:11:30PM +0100, Mel Gorman wrote:
> +	/*
> +	 * If reclaim is encountering dirty pages, it may be because
> +	 * dirty pages are reaching the end of the LRU even though
> +	 * the dirty_ratio may be satisified. In this case, wake
> +	 * flusher threads to pro-actively clean some pages
> +	 */
> +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> +

Where is the laptop-mode magic coming from?

And btw, at least currently wakeup_flusher_threads writes back nr_pages
for each BDI, which might not be what you want.  Then again probably
no caller wants it, but I don't see an easy way to fix it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages
  2010-07-19 14:15     ` Christoph Hellwig
@ 2010-07-19 14:24       ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 14:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 10:15:01AM -0400, Christoph Hellwig wrote:
> On Mon, Jul 19, 2010 at 02:11:24PM +0100, Mel Gorman wrote:
> > It is useful to distinguish between IO for anon and file pages. This
> > patch updates
> > vmscan-tracing-add-trace-event-when-a-page-is-written.patch to include
> > that information. The patches can be merged together.
> 
> I think the trace would be nicer if you #define flags for both
> cases and then use __print_flags on them.  That'll also make it more
> extensible in case we need to add more flags later.
> 

Not a bad idea, I'll check it out. Thanks. The first flags would be;

RECLAIM_WB_ANON
RECLAIM_WB_FILE

Does anyone have problems with the naming?


> And a purely procedural question:  This is supposed to get rolled into
> the original patch before it gets commited to a git tree, right?
> 

That is my expectation.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages
@ 2010-07-19 14:24       ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 14:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 10:15:01AM -0400, Christoph Hellwig wrote:
> On Mon, Jul 19, 2010 at 02:11:24PM +0100, Mel Gorman wrote:
> > It is useful to distinguish between IO for anon and file pages. This
> > patch updates
> > vmscan-tracing-add-trace-event-when-a-page-is-written.patch to include
> > that information. The patches can be merged together.
> 
> I think the trace would be nicer if you #define flags for both
> cases and then use __print_flags on them.  That'll also make it more
> extensible in case we need to add more flags later.
> 

Not a bad idea, I'll check it out. Thanks. The first flags would be;

RECLAIM_WB_ANON
RECLAIM_WB_FILE

Does anyone have problems with the naming?


> And a purely procedural question:  This is supposed to get rolled into
> the original patch before it gets commited to a git tree, right?
> 

That is my expectation.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages
  2010-07-19 14:24       ` Mel Gorman
@ 2010-07-19 14:26         ` Christoph Hellwig
  -1 siblings, 0 replies; 177+ messages in thread
From: Christoph Hellwig @ 2010-07-19 14:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli

On Mon, Jul 19, 2010 at 03:24:36PM +0100, Mel Gorman wrote:
> Not a bad idea, I'll check it out. Thanks. The first flags would be;
> 
> RECLAIM_WB_ANON
> RECLAIM_WB_FILE
> 
> Does anyone have problems with the naming?

The names look fine to me.


^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages
@ 2010-07-19 14:26         ` Christoph Hellwig
  0 siblings, 0 replies; 177+ messages in thread
From: Christoph Hellwig @ 2010-07-19 14:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli

On Mon, Jul 19, 2010 at 03:24:36PM +0100, Mel Gorman wrote:
> Not a bad idea, I'll check it out. Thanks. The first flags would be;
> 
> RECLAIM_WB_ANON
> RECLAIM_WB_FILE
> 
> Does anyone have problems with the naming?

The names look fine to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-19 14:19     ` Christoph Hellwig
@ 2010-07-19 14:26       ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 14:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 10:19:34AM -0400, Christoph Hellwig wrote:
> On Mon, Jul 19, 2010 at 02:11:26PM +0100, Mel Gorman wrote:
> > As the call-chain for writing anonymous pages is not expected to be deep
> > and they are not cleaned by flusher threads, anonymous pages are still
> > written back in direct reclaim.
> 
> While it is not quite as deep as it skips the filesystem allocator and
> extent mapping code it can still be quite deep for swap given that it
> still has to traverse the whole I/O stack.  Probably not worth worrying
> about now, but we need to keep an eye on it.
> 

Agreed that we need to keep an eye on it. If this ever becomes a
problem, we're going to need to consider a flusher for anonymous pages.
If you look at the figures, we are still doing a lot of writeback of
anonymous pages. Granted, the layout of swap sucks anyway but it's
something to keep at the back of the mind.

> The patch looks fine to me anyway.
> 

Thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-19 14:26       ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 14:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 10:19:34AM -0400, Christoph Hellwig wrote:
> On Mon, Jul 19, 2010 at 02:11:26PM +0100, Mel Gorman wrote:
> > As the call-chain for writing anonymous pages is not expected to be deep
> > and they are not cleaned by flusher threads, anonymous pages are still
> > written back in direct reclaim.
> 
> While it is not quite as deep as it skips the filesystem allocator and
> extent mapping code it can still be quite deep for swap given that it
> still has to traverse the whole I/O stack.  Probably not worth worrying
> about now, but we need to keep an eye on it.
> 

Agreed that we need to keep an eye on it. If this ever becomes a
problem, we're going to need to consider a flusher for anonymous pages.
If you look at the figures, we are still doing a lot of writeback of
anonymous pages. Granted, the layout of swap sucks anyway but it's
something to keep at the back of the mind.

> The patch looks fine to me anyway.
> 

Thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-19 14:23     ` Christoph Hellwig
@ 2010-07-19 14:37       ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 14:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 10:23:49AM -0400, Christoph Hellwig wrote:
> On Mon, Jul 19, 2010 at 02:11:30PM +0100, Mel Gorman wrote:
> > +	/*
> > +	 * If reclaim is encountering dirty pages, it may be because
> > +	 * dirty pages are reaching the end of the LRU even though
> > +	 * the dirty_ratio may be satisified. In this case, wake
> > +	 * flusher threads to pro-actively clean some pages
> > +	 */
> > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > +
> 
> Where is the laptop-mode magic coming from?
> 

It comes from other parts of page reclaim where writing pages is avoided
by page reclaim where possible. Things like this

	wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);

and

	.may_writepage = !laptop_mode

although the latter can get disabled too. Deleting the magic is an
option which would trade IO efficiency for power efficiency but my
current thinking is laptop mode preferred reduced power.

> And btw, at least currently wakeup_flusher_threads writes back nr_pages
> for each BDI, which might not be what you want. 

I saw you pointing that out in another thread all right although I can't
remember the context. It's not exactly what I want but then again we
really want writing back of pages from a particular zone which we don't
get either. There did not seem to be an ideal here and this appeared to
be "less bad" than the alternatives.

> Then again probably
> no caller wants it, but I don't see an easy way to fix it.
> 

I didn't either but my writeback-foo is weak (getting better but still weak). I
hoped to bring it up at MM Summit and maybe at the Filesystem Summit too to
see what ideas exist to improve this.

When this idea was first floated, you called it a band-aid and I
prioritised writing back old inodes over this. How do you feel about
this approach now?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-19 14:37       ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 14:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 10:23:49AM -0400, Christoph Hellwig wrote:
> On Mon, Jul 19, 2010 at 02:11:30PM +0100, Mel Gorman wrote:
> > +	/*
> > +	 * If reclaim is encountering dirty pages, it may be because
> > +	 * dirty pages are reaching the end of the LRU even though
> > +	 * the dirty_ratio may be satisified. In this case, wake
> > +	 * flusher threads to pro-actively clean some pages
> > +	 */
> > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > +
> 
> Where is the laptop-mode magic coming from?
> 

It comes from other parts of page reclaim where writing pages is avoided
by page reclaim where possible. Things like this

	wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);

and

	.may_writepage = !laptop_mode

although the latter can get disabled too. Deleting the magic is an
option which would trade IO efficiency for power efficiency but my
current thinking is laptop mode preferred reduced power.

> And btw, at least currently wakeup_flusher_threads writes back nr_pages
> for each BDI, which might not be what you want. 

I saw you pointing that out in another thread all right although I can't
remember the context. It's not exactly what I want but then again we
really want writing back of pages from a particular zone which we don't
get either. There did not seem to be an ideal here and this appeared to
be "less bad" than the alternatives.

> Then again probably
> no caller wants it, but I don't see an easy way to fix it.
> 

I didn't either but my writeback-foo is weak (getting better but still weak). I
hoped to bring it up at MM Summit and maybe at the Filesystem Summit too to
see what ideas exist to improve this.

When this idea was first floated, you called it a band-aid and I
prioritised writing back old inodes over this. How do you feel about
this approach now?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-19 14:21     ` Christoph Hellwig
@ 2010-07-19 14:40       ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 14:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 10:21:45AM -0400, Christoph Hellwig wrote:
> On Mon, Jul 19, 2010 at 02:11:29PM +0100, Mel Gorman wrote:
> > From: Wu Fengguang <fengguang.wu@intel.com>
> > 
> > A background flush work may run for ever. So it's reasonable for it to
> > mimic the kupdate behavior of syncing old/expired inodes first.
> > 
> > This behavior also makes sense from the perspective of page reclaim.
> > File pages are added to the inactive list and promoted if referenced
> > after one recycling. If not referenced, it's very easy for pages to be
> > cleaned from reclaim context which is inefficient in terms of IO. If
> > background flush is cleaning pages, it's best it cleans old pages to
> > help minimise IO from reclaim.
> 
> Yes, we absolutely do this. 

Do you mean we absolutely want to do this?

> Wu, do you have an improved version of the
> pending or should we put it in this version for now?
> 

Some insight on how the other writeback changes that are being floated
around might affect the number of dirty pages reclaim encounters would also
be helpful. The tracepoints are there for people to figure it out but any
help figuring it out is useful.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-19 14:40       ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 14:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 10:21:45AM -0400, Christoph Hellwig wrote:
> On Mon, Jul 19, 2010 at 02:11:29PM +0100, Mel Gorman wrote:
> > From: Wu Fengguang <fengguang.wu@intel.com>
> > 
> > A background flush work may run for ever. So it's reasonable for it to
> > mimic the kupdate behavior of syncing old/expired inodes first.
> > 
> > This behavior also makes sense from the perspective of page reclaim.
> > File pages are added to the inactive list and promoted if referenced
> > after one recycling. If not referenced, it's very easy for pages to be
> > cleaned from reclaim context which is inefficient in terms of IO. If
> > background flush is cleaning pages, it's best it cleans old pages to
> > help minimise IO from reclaim.
> 
> Yes, we absolutely do this. 

Do you mean we absolutely want to do this?

> Wu, do you have an improved version of the
> pending or should we put it in this version for now?
> 

Some insight on how the other writeback changes that are being floated
around might affect the number of dirty pages reclaim encounters would also
be helpful. The tracepoints are there for people to figure it out but any
help figuring it out is useful.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 6/8] fs,xfs: Allow kswapd to writeback pages
  2010-07-19 14:20     ` Christoph Hellwig
@ 2010-07-19 14:43       ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 14:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 10:20:51AM -0400, Christoph Hellwig wrote:
> On Mon, Jul 19, 2010 at 02:11:28PM +0100, Mel Gorman wrote:
> > As only kswapd and memcg are writing back pages, there should be no
> > danger of overflowing the stack. Allow the writing back of dirty pages
> > in xfs from the VM.
> 
> As pointed out during the discussion on one of your previous post memcg
> does pose a huge risk of stack overflows. 

I remember. This is partially to nudge the memcg people to see where
they currently stand with alleviating the problem.

> In the XFS tree we've already
> relaxed the check to allow writeback from kswapd, and until the memcg
> situation we'll need to keep that check.
> 

If memcg remains a problem, I'll drop these two patches.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 6/8] fs,xfs: Allow kswapd to writeback pages
@ 2010-07-19 14:43       ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-19 14:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 10:20:51AM -0400, Christoph Hellwig wrote:
> On Mon, Jul 19, 2010 at 02:11:28PM +0100, Mel Gorman wrote:
> > As only kswapd and memcg are writing back pages, there should be no
> > danger of overflowing the stack. Allow the writing back of dirty pages
> > in xfs from the VM.
> 
> As pointed out during the discussion on one of your previous post memcg
> does pose a huge risk of stack overflows. 

I remember. This is partially to nudge the memcg people to see where
they currently stand with alleviating the problem.

> In the XFS tree we've already
> relaxed the check to allow writeback from kswapd, and until the memcg
> situation we'll need to keep that check.
> 

If memcg remains a problem, I'll drop these two patches.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-19 14:40       ` Mel Gorman
@ 2010-07-19 14:48         ` Christoph Hellwig
  -1 siblings, 0 replies; 177+ messages in thread
From: Christoph Hellwig @ 2010-07-19 14:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli

On Mon, Jul 19, 2010 at 03:40:47PM +0100, Mel Gorman wrote:
> > Yes, we absolutely do this. 
> 
> Do you mean we absolutely want to do this?

Ermm yes, sorry.


^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-19 14:48         ` Christoph Hellwig
  0 siblings, 0 replies; 177+ messages in thread
From: Christoph Hellwig @ 2010-07-19 14:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli

On Mon, Jul 19, 2010 at 03:40:47PM +0100, Mel Gorman wrote:
> > Yes, we absolutely do this. 
> 
> Do you mean we absolutely want to do this?

Ermm yes, sorry.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-19 13:11   ` Mel Gorman
@ 2010-07-19 18:25     ` Rik van Riel
  -1 siblings, 0 replies; 177+ messages in thread
From: Rik van Riel @ 2010-07-19 18:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On 07/19/2010 09:11 AM, Mel Gorman wrote:
> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
>
> This patch prevents direct reclaim writing back filesystem pages by checking
> if current is kswapd or the page is anonymous before writing back.  If the
> dirty pages cannot be written back, they are placed back on the LRU lists
> for either background writing by the BDI threads or kswapd. If in direct
> lumpy reclaim and dirty pages are encountered, the process will stall for
> the background flusher before trying to reclaim the pages again.
>
> As the call-chain for writing anonymous pages is not expected to be deep
> and they are not cleaned by flusher threads, anonymous pages are still
> written back in direct reclaim.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-19 18:25     ` Rik van Riel
  0 siblings, 0 replies; 177+ messages in thread
From: Rik van Riel @ 2010-07-19 18:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On 07/19/2010 09:11 AM, Mel Gorman wrote:
> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
>
> This patch prevents direct reclaim writing back filesystem pages by checking
> if current is kswapd or the page is anonymous before writing back.  If the
> dirty pages cannot be written back, they are placed back on the LRU lists
> for either background writing by the BDI threads or kswapd. If in direct
> lumpy reclaim and dirty pages are encountered, the process will stall for
> the background flusher before trying to reclaim the pages again.
>
> As the call-chain for writing anonymous pages is not expected to be deep
> and they are not cleaned by flusher threads, anonymous pages are still
> written back in direct reclaim.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 5/8] fs,btrfs: Allow kswapd to writeback pages
  2010-07-19 13:11   ` Mel Gorman
@ 2010-07-19 18:27     ` Rik van Riel
  -1 siblings, 0 replies; 177+ messages in thread
From: Rik van Riel @ 2010-07-19 18:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On 07/19/2010 09:11 AM, Mel Gorman wrote:
> As only kswapd and memcg are writing back pages, there should be no
> danger of overflowing the stack. Allow the writing back of dirty pages
> in btrfs from the VM.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 5/8] fs,btrfs: Allow kswapd to writeback pages
@ 2010-07-19 18:27     ` Rik van Riel
  0 siblings, 0 replies; 177+ messages in thread
From: Rik van Riel @ 2010-07-19 18:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On 07/19/2010 09:11 AM, Mel Gorman wrote:
> As only kswapd and memcg are writing back pages, there should be no
> danger of overflowing the stack. Allow the writing back of dirty pages
> in btrfs from the VM.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-19 13:11   ` Mel Gorman
@ 2010-07-19 18:43     ` Rik van Riel
  -1 siblings, 0 replies; 177+ messages in thread
From: Rik van Riel @ 2010-07-19 18:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On 07/19/2010 09:11 AM, Mel Gorman wrote:
> From: Wu Fengguang<fengguang.wu@intel.com>
>
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.
>
> This behavior also makes sense from the perspective of page reclaim.
> File pages are added to the inactive list and promoted if referenced
> after one recycling. If not referenced, it's very easy for pages to be
> cleaned from reclaim context which is inefficient in terms of IO. If
> background flush is cleaning pages, it's best it cleans old pages to
> help minimise IO from reclaim.
>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

It can probably be optimized, but we really need something
like this...

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-19 18:43     ` Rik van Riel
  0 siblings, 0 replies; 177+ messages in thread
From: Rik van Riel @ 2010-07-19 18:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On 07/19/2010 09:11 AM, Mel Gorman wrote:
> From: Wu Fengguang<fengguang.wu@intel.com>
>
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.
>
> This behavior also makes sense from the perspective of page reclaim.
> File pages are added to the inactive list and promoted if referenced
> after one recycling. If not referenced, it's very easy for pages to be
> cleaned from reclaim context which is inefficient in terms of IO. If
> background flush is cleaning pages, it's best it cleans old pages to
> help minimise IO from reclaim.
>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

It can probably be optimized, but we really need something
like this...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-19 13:11   ` Mel Gorman
@ 2010-07-19 18:59     ` Rik van Riel
  -1 siblings, 0 replies; 177+ messages in thread
From: Rik van Riel @ 2010-07-19 18:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On 07/19/2010 09:11 AM, Mel Gorman wrote:
> There are a number of cases where pages get cleaned but two of concern
> to this patch are;
>    o When dirtying pages, processes may be throttled to clean pages if
>      dirty_ratio is not met.
>    o Pages belonging to inodes dirtied longer than
>      dirty_writeback_centisecs get cleaned.
>
> The problem for reclaim is that dirty pages can reach the end of the LRU
> if pages are being dirtied slowly so that neither the throttling cleans
> them or a flusher thread waking periodically.

I can't see a better way to do this without creating
a way-too-big-to-merge patch series, and this patch
should result in the right behaviour, so ...

Acked-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-19 18:59     ` Rik van Riel
  0 siblings, 0 replies; 177+ messages in thread
From: Rik van Riel @ 2010-07-19 18:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On 07/19/2010 09:11 AM, Mel Gorman wrote:
> There are a number of cases where pages get cleaned but two of concern
> to this patch are;
>    o When dirtying pages, processes may be throttled to clean pages if
>      dirty_ratio is not met.
>    o Pages belonging to inodes dirtied longer than
>      dirty_writeback_centisecs get cleaned.
>
> The problem for reclaim is that dirty pages can reach the end of the LRU
> if pages are being dirtied slowly so that neither the throttling cleans
> them or a flusher thread waking periodically.

I can't see a better way to do this without creating
a way-too-big-to-merge patch series, and this patch
should result in the right behaviour, so ...

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-19 13:11   ` Mel Gorman
@ 2010-07-19 22:14     ` Johannes Weiner
  -1 siblings, 0 replies; 177+ messages in thread
From: Johannes Weiner @ 2010-07-19 22:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

Hi Mel,

On Mon, Jul 19, 2010 at 02:11:26PM +0100, Mel Gorman wrote:
> @@ -406,7 +461,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
>  		return PAGE_SUCCESS;
>  	}

Did you forget to delete the worker code from pageout() which is now
in write_reclaim_page()?

> -	return PAGE_CLEAN;
> +	return write_reclaim_page(page, mapping, sync_writeback);
>  }
>  
>  /*
> @@ -639,6 +694,9 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
>  	pagevec_free(&freed_pvec);
>  }
>  
> +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> +#define MAX_SWAP_CLEAN_WAIT 50
> +
>  /*
>   * shrink_page_list() returns the number of reclaimed pages
>   */
> @@ -646,13 +704,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  					struct scan_control *sc,
>  					enum pageout_io sync_writeback)
>  {
> -	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
> -	int pgactivate = 0;
> +	LIST_HEAD(putback_pages);
> +	LIST_HEAD(dirty_pages);
> +	int pgactivate;
> +	int dirty_isolated = 0;
> +	unsigned long nr_dirty;
>  	unsigned long nr_reclaimed = 0;
>  
> +	pgactivate = 0;
>  	cond_resched();
>  
> +restart_dirty:
> +	nr_dirty = 0;
>  	while (!list_empty(page_list)) {
>  		enum page_references references;
>  		struct address_space *mapping;
> @@ -741,7 +805,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			}
>  		}
>  
> -		if (PageDirty(page)) {
> +		if (PageDirty(page))  {
> +			/*
> +			 * If the caller cannot writeback pages, dirty pages
> +			 * are put on a separate list for cleaning by either
> +			 * a flusher thread or kswapd
> +			 */
> +			if (!reclaim_can_writeback(sc, page)) {
> +				list_add(&page->lru, &dirty_pages);
> +				unlock_page(page);
> +				nr_dirty++;
> +				goto keep_dirty;
> +			}
> +
>  			if (references == PAGEREF_RECLAIM_CLEAN)
>  				goto keep_locked;
>  			if (!may_enter_fs)
> @@ -852,13 +928,39 @@ activate_locked:
>  keep_locked:
>  		unlock_page(page);
>  keep:
> -		list_add(&page->lru, &ret_pages);
> +		list_add(&page->lru, &putback_pages);
> +keep_dirty:
>  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
>  	}
>  
> +	if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
> +		/*
> +		 * Wakeup a flusher thread to clean at least as many dirty
> +		 * pages as encountered by direct reclaim. Wait on congestion
> +		 * to throttle processes cleaning dirty pages
> +		 */
> +		wakeup_flusher_threads(nr_dirty);
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
> +
> +		/*
> +		 * As lumpy reclaim and memcg targets specific pages, wait on
> +		 * them to be cleaned and try reclaim again.
> +		 */
> +		if (sync_writeback == PAGEOUT_IO_SYNC ||
> +						sc->mem_cgroup != NULL) {
> +			dirty_isolated++;
> +			list_splice(&dirty_pages, page_list);
> +			INIT_LIST_HEAD(&dirty_pages);
> +			goto restart_dirty;
> +		}
> +	}

I think it would turn out more natural to just return dirty pages on
page_list and have the whole looping logic in shrink_inactive_list().

Mixing dirty pages with other 'please try again' pages is probably not
so bad anyway, it means we could retry all temporary unavailable pages
instead of twiddling thumbs over that particular bunch of pages until
the flushers catch up.

What do you think?

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-19 22:14     ` Johannes Weiner
  0 siblings, 0 replies; 177+ messages in thread
From: Johannes Weiner @ 2010-07-19 22:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

Hi Mel,

On Mon, Jul 19, 2010 at 02:11:26PM +0100, Mel Gorman wrote:
> @@ -406,7 +461,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
>  		return PAGE_SUCCESS;
>  	}

Did you forget to delete the worker code from pageout() which is now
in write_reclaim_page()?

> -	return PAGE_CLEAN;
> +	return write_reclaim_page(page, mapping, sync_writeback);
>  }
>  
>  /*
> @@ -639,6 +694,9 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
>  	pagevec_free(&freed_pvec);
>  }
>  
> +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> +#define MAX_SWAP_CLEAN_WAIT 50
> +
>  /*
>   * shrink_page_list() returns the number of reclaimed pages
>   */
> @@ -646,13 +704,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  					struct scan_control *sc,
>  					enum pageout_io sync_writeback)
>  {
> -	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
> -	int pgactivate = 0;
> +	LIST_HEAD(putback_pages);
> +	LIST_HEAD(dirty_pages);
> +	int pgactivate;
> +	int dirty_isolated = 0;
> +	unsigned long nr_dirty;
>  	unsigned long nr_reclaimed = 0;
>  
> +	pgactivate = 0;
>  	cond_resched();
>  
> +restart_dirty:
> +	nr_dirty = 0;
>  	while (!list_empty(page_list)) {
>  		enum page_references references;
>  		struct address_space *mapping;
> @@ -741,7 +805,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			}
>  		}
>  
> -		if (PageDirty(page)) {
> +		if (PageDirty(page))  {
> +			/*
> +			 * If the caller cannot writeback pages, dirty pages
> +			 * are put on a separate list for cleaning by either
> +			 * a flusher thread or kswapd
> +			 */
> +			if (!reclaim_can_writeback(sc, page)) {
> +				list_add(&page->lru, &dirty_pages);
> +				unlock_page(page);
> +				nr_dirty++;
> +				goto keep_dirty;
> +			}
> +
>  			if (references == PAGEREF_RECLAIM_CLEAN)
>  				goto keep_locked;
>  			if (!may_enter_fs)
> @@ -852,13 +928,39 @@ activate_locked:
>  keep_locked:
>  		unlock_page(page);
>  keep:
> -		list_add(&page->lru, &ret_pages);
> +		list_add(&page->lru, &putback_pages);
> +keep_dirty:
>  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
>  	}
>  
> +	if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
> +		/*
> +		 * Wakeup a flusher thread to clean at least as many dirty
> +		 * pages as encountered by direct reclaim. Wait on congestion
> +		 * to throttle processes cleaning dirty pages
> +		 */
> +		wakeup_flusher_threads(nr_dirty);
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
> +
> +		/*
> +		 * As lumpy reclaim and memcg targets specific pages, wait on
> +		 * them to be cleaned and try reclaim again.
> +		 */
> +		if (sync_writeback == PAGEOUT_IO_SYNC ||
> +						sc->mem_cgroup != NULL) {
> +			dirty_isolated++;
> +			list_splice(&dirty_pages, page_list);
> +			INIT_LIST_HEAD(&dirty_pages);
> +			goto restart_dirty;
> +		}
> +	}

I think it would turn out more natural to just return dirty pages on
page_list and have the whole looping logic in shrink_inactive_list().

Mixing dirty pages with other 'please try again' pages is probably not
so bad anyway, it means we could retry all temporary unavailable pages
instead of twiddling thumbs over that particular bunch of pages until
the flushers catch up.

What do you think?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-19 13:11   ` Mel Gorman
@ 2010-07-19 22:26     ` Johannes Weiner
  -1 siblings, 0 replies; 177+ messages in thread
From: Johannes Weiner @ 2010-07-19 22:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 02:11:30PM +0100, Mel Gorman wrote:
> @@ -933,13 +934,16 @@ keep_dirty:
>  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
>  	}
>  
> +	/*
> +	 * If reclaim is encountering dirty pages, it may be because
> +	 * dirty pages are reaching the end of the LRU even though
> +	 * the dirty_ratio may be satisified. In this case, wake
> +	 * flusher threads to pro-actively clean some pages
> +	 */
> +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);

An argument of 0 means 'every dirty page in the system', I assume this
is not what you wanted, right?  Something like this?

	if (nr_dirty && !laptop_mode)
		wakeup_flusher_threads(nr_dirty + nr_dirty / 2);

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-19 22:26     ` Johannes Weiner
  0 siblings, 0 replies; 177+ messages in thread
From: Johannes Weiner @ 2010-07-19 22:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 02:11:30PM +0100, Mel Gorman wrote:
> @@ -933,13 +934,16 @@ keep_dirty:
>  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
>  	}
>  
> +	/*
> +	 * If reclaim is encountering dirty pages, it may be because
> +	 * dirty pages are reaching the end of the LRU even though
> +	 * the dirty_ratio may be satisified. In this case, wake
> +	 * flusher threads to pro-actively clean some pages
> +	 */
> +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);

An argument of 0 means 'every dirty page in the system', I assume this
is not what you wanted, right?  Something like this?

	if (nr_dirty && !laptop_mode)
		wakeup_flusher_threads(nr_dirty + nr_dirty / 2);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-19 14:37       ` Mel Gorman
@ 2010-07-19 22:48         ` Johannes Weiner
  -1 siblings, 0 replies; 177+ messages in thread
From: Johannes Weiner @ 2010-07-19 22:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 03:37:37PM +0100, Mel Gorman wrote:
> On Mon, Jul 19, 2010 at 10:23:49AM -0400, Christoph Hellwig wrote:
> > On Mon, Jul 19, 2010 at 02:11:30PM +0100, Mel Gorman wrote:
> > > +	/*
> > > +	 * If reclaim is encountering dirty pages, it may be because
> > > +	 * dirty pages are reaching the end of the LRU even though
> > > +	 * the dirty_ratio may be satisified. In this case, wake
> > > +	 * flusher threads to pro-actively clean some pages
> > > +	 */
> > > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > +
> > 
> > Where is the laptop-mode magic coming from?
> > 
> 
> It comes from other parts of page reclaim where writing pages is avoided
> by page reclaim where possible. Things like this
> 
> 	wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);

Actually, it's not avoiding writing pages in laptop mode, instead it
is lumping writeouts aggressively (as I wrote in my other mail,
.nr_pages=0 means 'write everything') to keep disk spinups rare and
make maximum use of them.

> although the latter can get disabled too. Deleting the magic is an
> option which would trade IO efficiency for power efficiency but my
> current thinking is laptop mode preferred reduced power.

Maybe couple your wakeup with sc->may_writepage?  It is usually false
for laptop_mode but direct reclaimers enable it at one point in
do_try_to_free_pages() when it scanned more than 150% of the reclaim
target, so you could use existing disk spin-up points instead of
introducing new ones or disabling the heuristics in laptop mode.

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-19 22:48         ` Johannes Weiner
  0 siblings, 0 replies; 177+ messages in thread
From: Johannes Weiner @ 2010-07-19 22:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 03:37:37PM +0100, Mel Gorman wrote:
> On Mon, Jul 19, 2010 at 10:23:49AM -0400, Christoph Hellwig wrote:
> > On Mon, Jul 19, 2010 at 02:11:30PM +0100, Mel Gorman wrote:
> > > +	/*
> > > +	 * If reclaim is encountering dirty pages, it may be because
> > > +	 * dirty pages are reaching the end of the LRU even though
> > > +	 * the dirty_ratio may be satisified. In this case, wake
> > > +	 * flusher threads to pro-actively clean some pages
> > > +	 */
> > > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > +
> > 
> > Where is the laptop-mode magic coming from?
> > 
> 
> It comes from other parts of page reclaim where writing pages is avoided
> by page reclaim where possible. Things like this
> 
> 	wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);

Actually, it's not avoiding writing pages in laptop mode, instead it
is lumping writeouts aggressively (as I wrote in my other mail,
.nr_pages=0 means 'write everything') to keep disk spinups rare and
make maximum use of them.

> although the latter can get disabled too. Deleting the magic is an
> option which would trade IO efficiency for power efficiency but my
> current thinking is laptop mode preferred reduced power.

Maybe couple your wakeup with sc->may_writepage?  It is usually false
for laptop_mode but direct reclaimers enable it at one point in
do_try_to_free_pages() when it scanned more than 150% of the reclaim
target, so you could use existing disk spin-up points instead of
introducing new ones or disabling the heuristics in laptop mode.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-19 22:14     ` Johannes Weiner
@ 2010-07-20 13:45       ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-20 13:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jul 20, 2010 at 12:14:20AM +0200, Johannes Weiner wrote:
> Hi Mel,
> 
> On Mon, Jul 19, 2010 at 02:11:26PM +0100, Mel Gorman wrote:
> > @@ -406,7 +461,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
> >  		return PAGE_SUCCESS;
> >  	}
> 
> Did you forget to delete the worker code from pageout() which is now
> in write_reclaim_page()?
> 

Damn, a snarl during the final rebase when collapsing patches together that
I missed when re-reading. Sorry :(

> > -	return PAGE_CLEAN;
> > +	return write_reclaim_page(page, mapping, sync_writeback);
> >  }
> >  
> >  /*
> > @@ -639,6 +694,9 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> >  	pagevec_free(&freed_pvec);
> >  }
> >  
> > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> > +#define MAX_SWAP_CLEAN_WAIT 50
> > +
> >  /*
> >   * shrink_page_list() returns the number of reclaimed pages
> >   */
> > @@ -646,13 +704,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  					struct scan_control *sc,
> >  					enum pageout_io sync_writeback)
> >  {
> > -	LIST_HEAD(ret_pages);
> >  	LIST_HEAD(free_pages);
> > -	int pgactivate = 0;
> > +	LIST_HEAD(putback_pages);
> > +	LIST_HEAD(dirty_pages);
> > +	int pgactivate;
> > +	int dirty_isolated = 0;
> > +	unsigned long nr_dirty;
> >  	unsigned long nr_reclaimed = 0;
> >  
> > +	pgactivate = 0;
> >  	cond_resched();
> >  
> > +restart_dirty:
> > +	nr_dirty = 0;
> >  	while (!list_empty(page_list)) {
> >  		enum page_references references;
> >  		struct address_space *mapping;
> > @@ -741,7 +805,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  			}
> >  		}
> >  
> > -		if (PageDirty(page)) {
> > +		if (PageDirty(page))  {
> > +			/*
> > +			 * If the caller cannot writeback pages, dirty pages
> > +			 * are put on a separate list for cleaning by either
> > +			 * a flusher thread or kswapd
> > +			 */
> > +			if (!reclaim_can_writeback(sc, page)) {
> > +				list_add(&page->lru, &dirty_pages);
> > +				unlock_page(page);
> > +				nr_dirty++;
> > +				goto keep_dirty;
> > +			}
> > +
> >  			if (references == PAGEREF_RECLAIM_CLEAN)
> >  				goto keep_locked;
> >  			if (!may_enter_fs)
> > @@ -852,13 +928,39 @@ activate_locked:
> >  keep_locked:
> >  		unlock_page(page);
> >  keep:
> > -		list_add(&page->lru, &ret_pages);
> > +		list_add(&page->lru, &putback_pages);
> > +keep_dirty:
> >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> >  	}
> >  
> > +	if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
> > +		/*
> > +		 * Wakeup a flusher thread to clean at least as many dirty
> > +		 * pages as encountered by direct reclaim. Wait on congestion
> > +		 * to throttle processes cleaning dirty pages
> > +		 */
> > +		wakeup_flusher_threads(nr_dirty);
> > +		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +
> > +		/*
> > +		 * As lumpy reclaim and memcg targets specific pages, wait on
> > +		 * them to be cleaned and try reclaim again.
> > +		 */
> > +		if (sync_writeback == PAGEOUT_IO_SYNC ||
> > +						sc->mem_cgroup != NULL) {
> > +			dirty_isolated++;
> > +			list_splice(&dirty_pages, page_list);
> > +			INIT_LIST_HEAD(&dirty_pages);
> > +			goto restart_dirty;
> > +		}
> > +	}
> 
> I think it would turn out more natural to just return dirty pages on
> page_list and have the whole looping logic in shrink_inactive_list().
> 
> Mixing dirty pages with other 'please try again' pages is probably not
> so bad anyway, it means we could retry all temporary unavailable pages
> instead of twiddling thumbs over that particular bunch of pages until
> the flushers catch up.
> 
> What do you think?
> 

It's worth considering! It won't be very tidy but it's workable. The reason
it is not tidy is that dirty pages and pages that couldn't be paged will be
on the same list so they whole lot will need to be recycled. We'd record in
scan_control though that there were pages that need to be retried and loop
based on that value. That is managable though.

The reason why I did it this way was because of lumpy reclaim and memcg
requiring specific pages. I considered lumpy reclaim to be the more common
case. In that case, it's removing potentially a large number of pages from
the LRU that are contiguous. If some of those are dirty and it selects more
contiguous ranges for reclaim, I'd worry that lumpy reclaim would trash the
system even worse than it currently does when the system is under load. Hence,
this wait and retry loop is done instead of returning and isolating more pages.

For memcg, the concern was different. It is depending on flusher threads
to clean its pages, kswapd does not operate on the list and it can't clean
pages itself because the stack may overflow. If the memcg has many dirty
pages, one process in the container could isolate all the dirty pages in
the list forcing others to reclaim clean pages regardless of age. This
could be very disruptive but looping like this throttling processes that
encounter dirty pages instead of isolating more.

For lumpy, I don't think we should return and isolate more pages, it's
too disruptive. For memcg, I think it could possibly get an advantage
but there is a nasty corner case if the container is mostly dirty - it
depends on how memcg handles dirty_ratio I guess.

Is it worth it at this point?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-20 13:45       ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-20 13:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jul 20, 2010 at 12:14:20AM +0200, Johannes Weiner wrote:
> Hi Mel,
> 
> On Mon, Jul 19, 2010 at 02:11:26PM +0100, Mel Gorman wrote:
> > @@ -406,7 +461,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
> >  		return PAGE_SUCCESS;
> >  	}
> 
> Did you forget to delete the worker code from pageout() which is now
> in write_reclaim_page()?
> 

Damn, a snarl during the final rebase when collapsing patches together that
I missed when re-reading. Sorry :(

> > -	return PAGE_CLEAN;
> > +	return write_reclaim_page(page, mapping, sync_writeback);
> >  }
> >  
> >  /*
> > @@ -639,6 +694,9 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> >  	pagevec_free(&freed_pvec);
> >  }
> >  
> > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> > +#define MAX_SWAP_CLEAN_WAIT 50
> > +
> >  /*
> >   * shrink_page_list() returns the number of reclaimed pages
> >   */
> > @@ -646,13 +704,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  					struct scan_control *sc,
> >  					enum pageout_io sync_writeback)
> >  {
> > -	LIST_HEAD(ret_pages);
> >  	LIST_HEAD(free_pages);
> > -	int pgactivate = 0;
> > +	LIST_HEAD(putback_pages);
> > +	LIST_HEAD(dirty_pages);
> > +	int pgactivate;
> > +	int dirty_isolated = 0;
> > +	unsigned long nr_dirty;
> >  	unsigned long nr_reclaimed = 0;
> >  
> > +	pgactivate = 0;
> >  	cond_resched();
> >  
> > +restart_dirty:
> > +	nr_dirty = 0;
> >  	while (!list_empty(page_list)) {
> >  		enum page_references references;
> >  		struct address_space *mapping;
> > @@ -741,7 +805,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  			}
> >  		}
> >  
> > -		if (PageDirty(page)) {
> > +		if (PageDirty(page))  {
> > +			/*
> > +			 * If the caller cannot writeback pages, dirty pages
> > +			 * are put on a separate list for cleaning by either
> > +			 * a flusher thread or kswapd
> > +			 */
> > +			if (!reclaim_can_writeback(sc, page)) {
> > +				list_add(&page->lru, &dirty_pages);
> > +				unlock_page(page);
> > +				nr_dirty++;
> > +				goto keep_dirty;
> > +			}
> > +
> >  			if (references == PAGEREF_RECLAIM_CLEAN)
> >  				goto keep_locked;
> >  			if (!may_enter_fs)
> > @@ -852,13 +928,39 @@ activate_locked:
> >  keep_locked:
> >  		unlock_page(page);
> >  keep:
> > -		list_add(&page->lru, &ret_pages);
> > +		list_add(&page->lru, &putback_pages);
> > +keep_dirty:
> >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> >  	}
> >  
> > +	if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
> > +		/*
> > +		 * Wakeup a flusher thread to clean at least as many dirty
> > +		 * pages as encountered by direct reclaim. Wait on congestion
> > +		 * to throttle processes cleaning dirty pages
> > +		 */
> > +		wakeup_flusher_threads(nr_dirty);
> > +		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +
> > +		/*
> > +		 * As lumpy reclaim and memcg targets specific pages, wait on
> > +		 * them to be cleaned and try reclaim again.
> > +		 */
> > +		if (sync_writeback == PAGEOUT_IO_SYNC ||
> > +						sc->mem_cgroup != NULL) {
> > +			dirty_isolated++;
> > +			list_splice(&dirty_pages, page_list);
> > +			INIT_LIST_HEAD(&dirty_pages);
> > +			goto restart_dirty;
> > +		}
> > +	}
> 
> I think it would turn out more natural to just return dirty pages on
> page_list and have the whole looping logic in shrink_inactive_list().
> 
> Mixing dirty pages with other 'please try again' pages is probably not
> so bad anyway, it means we could retry all temporary unavailable pages
> instead of twiddling thumbs over that particular bunch of pages until
> the flushers catch up.
> 
> What do you think?
> 

It's worth considering! It won't be very tidy but it's workable. The reason
it is not tidy is that dirty pages and pages that couldn't be paged will be
on the same list so they whole lot will need to be recycled. We'd record in
scan_control though that there were pages that need to be retried and loop
based on that value. That is managable though.

The reason why I did it this way was because of lumpy reclaim and memcg
requiring specific pages. I considered lumpy reclaim to be the more common
case. In that case, it's removing potentially a large number of pages from
the LRU that are contiguous. If some of those are dirty and it selects more
contiguous ranges for reclaim, I'd worry that lumpy reclaim would trash the
system even worse than it currently does when the system is under load. Hence,
this wait and retry loop is done instead of returning and isolating more pages.

For memcg, the concern was different. It is depending on flusher threads
to clean its pages, kswapd does not operate on the list and it can't clean
pages itself because the stack may overflow. If the memcg has many dirty
pages, one process in the container could isolate all the dirty pages in
the list forcing others to reclaim clean pages regardless of age. This
could be very disruptive but looping like this throttling processes that
encounter dirty pages instead of isolating more.

For lumpy, I don't think we should return and isolate more pages, it's
too disruptive. For memcg, I think it could possibly get an advantage
but there is a nasty corner case if the container is mostly dirty - it
depends on how memcg handles dirty_ratio I guess.

Is it worth it at this point?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-19 22:48         ` Johannes Weiner
@ 2010-07-20 14:10           ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-20 14:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jul 20, 2010 at 12:48:39AM +0200, Johannes Weiner wrote:
> On Mon, Jul 19, 2010 at 03:37:37PM +0100, Mel Gorman wrote:
> > On Mon, Jul 19, 2010 at 10:23:49AM -0400, Christoph Hellwig wrote:
> > > On Mon, Jul 19, 2010 at 02:11:30PM +0100, Mel Gorman wrote:
> > > > +	/*
> > > > +	 * If reclaim is encountering dirty pages, it may be because
> > > > +	 * dirty pages are reaching the end of the LRU even though
> > > > +	 * the dirty_ratio may be satisified. In this case, wake
> > > > +	 * flusher threads to pro-actively clean some pages
> > > > +	 */
> > > > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > > +
> > > 
> > > Where is the laptop-mode magic coming from?
> > > 
> > 
> > It comes from other parts of page reclaim where writing pages is avoided
> > by page reclaim where possible. Things like this
> > 
> > 	wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
> 
> Actually, it's not avoiding writing pages in laptop mode, instead it
> is lumping writeouts aggressively (as I wrote in my other mail,
> .nr_pages=0 means 'write everything') to keep disk spinups rare and
> make maximum use of them.
> 

You're right, 0 does mean flush everything - /me slaps self. It was introduced
in 2.6.6 with the patch "[PATCH] laptop mode". Quoting from it

    Algorithm: the idea is to hold dirty data in memory for a long time,
    but to flush everything which has been accumulated if the disk happens
    to spin up for other reasons.

So, the reason for the magic is half right - avoid excessive disk spin-ups
but my reasoning for it was wrong. I thought it was avoiding a cleaning to
save power.  What it is actually intended to do is "if we are spinning up the
disk anyway, do as much work as possible so it can spin down for longer later".

Where it's wrong is that it should only wakeup flusher threads if dirty
pages were encountered. What it's doing right now is potentially
cleaning everything. It means I need to rerun all the tests and see if
the number of pages encountered by page reclaim is really reduced or was
it because I was calling wakeup_flusher_threads(0) when no dirty pages
were encountered.

> > although the latter can get disabled too. Deleting the magic is an
> > option which would trade IO efficiency for power efficiency but my
> > current thinking is laptop mode preferred reduced power.
> 
> Maybe couple your wakeup with sc->may_writepage?  It is usually false
> for laptop_mode but direct reclaimers enable it at one point in
> do_try_to_free_pages() when it scanned more than 150% of the reclaim
> target, so you could use existing disk spin-up points instead of
> introducing new ones or disabling the heuristics in laptop mode.
> 

How about the following?

        if (nr_dirty && sc->may_writepage)
                wakeup_flusher_threads(laptop_mode ? 0 :
                                                nr_dirty + nr_dirty / 2);


1. Wakup flusher threads if dirty pages are encountered
2. For direct reclaim, only wake them up if may_writepage is set
   indicating that the system is ready to spin up disks and start
   reclaiming
3. In laptop_mode, flush everything to reduce future spin-ups

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-20 14:10           ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-20 14:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jul 20, 2010 at 12:48:39AM +0200, Johannes Weiner wrote:
> On Mon, Jul 19, 2010 at 03:37:37PM +0100, Mel Gorman wrote:
> > On Mon, Jul 19, 2010 at 10:23:49AM -0400, Christoph Hellwig wrote:
> > > On Mon, Jul 19, 2010 at 02:11:30PM +0100, Mel Gorman wrote:
> > > > +	/*
> > > > +	 * If reclaim is encountering dirty pages, it may be because
> > > > +	 * dirty pages are reaching the end of the LRU even though
> > > > +	 * the dirty_ratio may be satisified. In this case, wake
> > > > +	 * flusher threads to pro-actively clean some pages
> > > > +	 */
> > > > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > > +
> > > 
> > > Where is the laptop-mode magic coming from?
> > > 
> > 
> > It comes from other parts of page reclaim where writing pages is avoided
> > by page reclaim where possible. Things like this
> > 
> > 	wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
> 
> Actually, it's not avoiding writing pages in laptop mode, instead it
> is lumping writeouts aggressively (as I wrote in my other mail,
> .nr_pages=0 means 'write everything') to keep disk spinups rare and
> make maximum use of them.
> 

You're right, 0 does mean flush everything - /me slaps self. It was introduced
in 2.6.6 with the patch "[PATCH] laptop mode". Quoting from it

    Algorithm: the idea is to hold dirty data in memory for a long time,
    but to flush everything which has been accumulated if the disk happens
    to spin up for other reasons.

So, the reason for the magic is half right - avoid excessive disk spin-ups
but my reasoning for it was wrong. I thought it was avoiding a cleaning to
save power.  What it is actually intended to do is "if we are spinning up the
disk anyway, do as much work as possible so it can spin down for longer later".

Where it's wrong is that it should only wakeup flusher threads if dirty
pages were encountered. What it's doing right now is potentially
cleaning everything. It means I need to rerun all the tests and see if
the number of pages encountered by page reclaim is really reduced or was
it because I was calling wakeup_flusher_threads(0) when no dirty pages
were encountered.

> > although the latter can get disabled too. Deleting the magic is an
> > option which would trade IO efficiency for power efficiency but my
> > current thinking is laptop mode preferred reduced power.
> 
> Maybe couple your wakeup with sc->may_writepage?  It is usually false
> for laptop_mode but direct reclaimers enable it at one point in
> do_try_to_free_pages() when it scanned more than 150% of the reclaim
> target, so you could use existing disk spin-up points instead of
> introducing new ones or disabling the heuristics in laptop mode.
> 

How about the following?

        if (nr_dirty && sc->may_writepage)
                wakeup_flusher_threads(laptop_mode ? 0 :
                                                nr_dirty + nr_dirty / 2);


1. Wakup flusher threads if dirty pages are encountered
2. For direct reclaim, only wake them up if may_writepage is set
   indicating that the system is ready to spin up disks and start
   reclaiming
3. In laptop_mode, flush everything to reduce future spin-ups

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-20 13:45       ` Mel Gorman
@ 2010-07-20 22:02         ` Johannes Weiner
  -1 siblings, 0 replies; 177+ messages in thread
From: Johannes Weiner @ 2010-07-20 22:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jul 20, 2010 at 02:45:56PM +0100, Mel Gorman wrote:
> On Tue, Jul 20, 2010 at 12:14:20AM +0200, Johannes Weiner wrote:
> > > @@ -639,6 +694,9 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> > >  	pagevec_free(&freed_pvec);
> > >  }
> > >  
> > > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> > > +#define MAX_SWAP_CLEAN_WAIT 50
> > > +
> > >  /*
> > >   * shrink_page_list() returns the number of reclaimed pages
> > >   */
> > > @@ -646,13 +704,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > >  					struct scan_control *sc,
> > >  					enum pageout_io sync_writeback)
> > >  {
> > > -	LIST_HEAD(ret_pages);
> > >  	LIST_HEAD(free_pages);
> > > -	int pgactivate = 0;
> > > +	LIST_HEAD(putback_pages);
> > > +	LIST_HEAD(dirty_pages);
> > > +	int pgactivate;
> > > +	int dirty_isolated = 0;
> > > +	unsigned long nr_dirty;
> > >  	unsigned long nr_reclaimed = 0;
> > >  
> > > +	pgactivate = 0;
> > >  	cond_resched();
> > >  
> > > +restart_dirty:
> > > +	nr_dirty = 0;
> > >  	while (!list_empty(page_list)) {
> > >  		enum page_references references;
> > >  		struct address_space *mapping;
> > > @@ -741,7 +805,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > >  			}
> > >  		}
> > >  
> > > -		if (PageDirty(page)) {
> > > +		if (PageDirty(page))  {
> > > +			/*
> > > +			 * If the caller cannot writeback pages, dirty pages
> > > +			 * are put on a separate list for cleaning by either
> > > +			 * a flusher thread or kswapd
> > > +			 */
> > > +			if (!reclaim_can_writeback(sc, page)) {
> > > +				list_add(&page->lru, &dirty_pages);
> > > +				unlock_page(page);
> > > +				nr_dirty++;
> > > +				goto keep_dirty;
> > > +			}
> > > +
> > >  			if (references == PAGEREF_RECLAIM_CLEAN)
> > >  				goto keep_locked;
> > >  			if (!may_enter_fs)
> > > @@ -852,13 +928,39 @@ activate_locked:
> > >  keep_locked:
> > >  		unlock_page(page);
> > >  keep:
> > > -		list_add(&page->lru, &ret_pages);
> > > +		list_add(&page->lru, &putback_pages);
> > > +keep_dirty:
> > >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > >  	}
> > >  
> > > +	if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
> > > +		/*
> > > +		 * Wakeup a flusher thread to clean at least as many dirty
> > > +		 * pages as encountered by direct reclaim. Wait on congestion
> > > +		 * to throttle processes cleaning dirty pages
> > > +		 */
> > > +		wakeup_flusher_threads(nr_dirty);
> > > +		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > +
> > > +		/*
> > > +		 * As lumpy reclaim and memcg targets specific pages, wait on
> > > +		 * them to be cleaned and try reclaim again.
> > > +		 */
> > > +		if (sync_writeback == PAGEOUT_IO_SYNC ||
> > > +						sc->mem_cgroup != NULL) {
> > > +			dirty_isolated++;
> > > +			list_splice(&dirty_pages, page_list);
> > > +			INIT_LIST_HEAD(&dirty_pages);
> > > +			goto restart_dirty;
> > > +		}
> > > +	}
> > 
> > I think it would turn out more natural to just return dirty pages on
> > page_list and have the whole looping logic in shrink_inactive_list().
> > 
> > Mixing dirty pages with other 'please try again' pages is probably not
> > so bad anyway, it means we could retry all temporary unavailable pages
> > instead of twiddling thumbs over that particular bunch of pages until
> > the flushers catch up.
> > 
> > What do you think?
> > 
> 
> It's worth considering! It won't be very tidy but it's workable. The reason
> it is not tidy is that dirty pages and pages that couldn't be paged will be
> on the same list so they whole lot will need to be recycled. We'd record in
> scan_control though that there were pages that need to be retried and loop
> based on that value. That is managable though.

Recycling all of them is what I had in mind, yeah.  But...

> The reason why I did it this way was because of lumpy reclaim and memcg
> requiring specific pages. I considered lumpy reclaim to be the more common
> case. In that case, it's removing potentially a large number of pages from
> the LRU that are contiguous. If some of those are dirty and it selects more
> contiguous ranges for reclaim, I'd worry that lumpy reclaim would trash the
> system even worse than it currently does when the system is under load. Hence,
> this wait and retry loop is done instead of returning and isolating more pages.

I think here we missed each other.  I don't want the loop to be _that_
much more in the outer scope that isolation is repeated as well.  What
I had in mind is the attached patch.  It is not tested and hacked up
rather quickly due to time constraints, sorry, but you should get the
idea.  I hope I did not miss anything fundamental.

Note that since only kswapd enters pageout() anymore, everything
depending on PAGEOUT_IO_SYNC in there is moot, since there are no sync
cycles for kswapd.  Just to mitigate the WTF-count on the patch :-)

	Hannes

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -386,21 +386,17 @@ static pageout_t pageout(struct page *pa
 			ClearPageReclaim(page);
 			return PAGE_ACTIVATE;
 		}
-
-		/*
-		 * Wait on writeback if requested to. This happens when
-		 * direct reclaiming a large contiguous area and the
-		 * first attempt to free a range of pages fails.
-		 */
-		if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
-			wait_on_page_writeback(page);
-
 		if (!PageWriteback(page)) {
 			/* synchronous write or broken a_ops? */
 			ClearPageReclaim(page);
 		}
 		trace_mm_vmscan_writepage(page,
 			page_is_file_cache(page),
+			/*
+			 * Humm.  Only kswapd comes here and for
+			 * kswapd there never is a PAGEOUT_IO_SYNC
+			 * cycle...
+			 */
 			sync_writeback == PAGEOUT_IO_SYNC);
 		inc_zone_page_state(page, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
@@ -643,12 +639,14 @@ static noinline_for_stack void free_page
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
-					struct scan_control *sc,
-					enum pageout_io sync_writeback)
+				      struct scan_control *sc,
+				      enum pageout_io sync_writeback,
+				      int *dirty_seen)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
 	int pgactivate = 0;
+	unsigned long nr_dirty = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
@@ -657,7 +655,7 @@ static unsigned long shrink_page_list(st
 		enum page_references references;
 		struct address_space *mapping;
 		struct page *page;
-		int may_enter_fs;
+		int may_pageout;
 
 		cond_resched();
 
@@ -681,10 +679,15 @@ static unsigned long shrink_page_list(st
 		if (page_mapped(page) || PageSwapCache(page))
 			sc->nr_scanned++;
 
-		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
+		/*
+		 * To prevent stack overflows, only kswapd can enter
+		 * the filesystem.  Swap IO is always fine (for now).
+		 */
+		may_pageout = current_is_kswapd() ||
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 
 		if (PageWriteback(page)) {
+			int may_wait;
 			/*
 			 * Synchronous reclaim is performed in two passes,
 			 * first an asynchronous pass over the list to
@@ -693,7 +696,8 @@ static unsigned long shrink_page_list(st
 			 * for any page for which writeback has already
 			 * started.
 			 */
-			if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
+			may_wait = (sc->gfp_mask & __GFP_FS) || may_pageout;
+			if (sync_writeback == PAGEOUT_IO_SYNC && may_wait)
 				wait_on_page_writeback(page);
 			else
 				goto keep_locked;
@@ -719,7 +723,7 @@ static unsigned long shrink_page_list(st
 				goto keep_locked;
 			if (!add_to_swap(page))
 				goto activate_locked;
-			may_enter_fs = 1;
+			may_pageout = 1;
 		}
 
 		mapping = page_mapping(page);
@@ -742,9 +746,11 @@ static unsigned long shrink_page_list(st
 		}
 
 		if (PageDirty(page)) {
+			nr_dirty++;
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
-			if (!may_enter_fs)
+			if (!may_pageout)
 				goto keep_locked;
 			if (!sc->may_writepage)
 				goto keep_locked;
@@ -860,6 +866,7 @@ keep:
 
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
+	*dirty_seen = nr_dirty;
 	return nr_reclaimed;
 }
 
@@ -1232,6 +1239,9 @@ static noinline_for_stack void update_is
 	reclaim_stat->recent_scanned[1] += *nr_file;
 }
 
+/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
+#define MAX_SWAP_CLEAN_WAIT 50
+
 /*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
@@ -1247,6 +1257,7 @@ shrink_inactive_list(unsigned long nr_to
 	unsigned long nr_active;
 	unsigned long nr_anon;
 	unsigned long nr_file;
+	unsigned long nr_dirty;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1295,26 +1306,34 @@ shrink_inactive_list(unsigned long nr_to
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
-
+	nr_reclaimed = shrink_page_list(&page_list, sc,
+					PAGEOUT_IO_ASYNC,
+					&nr_dirty);
 	/*
 	 * If we are direct reclaiming for contiguous pages and we do
 	 * not reclaim everything in the list, try again and wait
 	 * for IO to complete. This will stall high-order allocations
 	 * but that should be acceptable to the caller
 	 */
-	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
-			sc->lumpy_reclaim_mode) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+	if (!current_is_kswapd() && sc->lumpy_reclaim_mode || sc->mem_cgroup) {
+		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
 
-		/*
-		 * The attempt at page out may have made some
-		 * of the pages active, mark them inactive again.
-		 */
-		nr_active = clear_active_flags(&page_list, NULL);
-		count_vm_events(PGDEACTIVATE, nr_active);
+		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
+			wakeup_flusher_threads(nr_dirty);
+			congestion_wait(BLK_RW_ASYNC, HZ/10);
+			/*
+			 * The attempt at page out may have made some
+			 * of the pages active, mark them inactive again.
+			 *
+			 * Humm.  Still needed?
+			 */
+			nr_active = clear_active_flags(&page_list, NULL);
+			count_vm_events(PGDEACTIVATE, nr_active);
 
-		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+			nr_reclaimed += shrink_page_list(&page_list, sc,
+							 PAGEOUT_IO_SYNC,
+							 &nr_dirty);
+		}
 	}
 
 	local_irq_disable();

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-20 22:02         ` Johannes Weiner
  0 siblings, 0 replies; 177+ messages in thread
From: Johannes Weiner @ 2010-07-20 22:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jul 20, 2010 at 02:45:56PM +0100, Mel Gorman wrote:
> On Tue, Jul 20, 2010 at 12:14:20AM +0200, Johannes Weiner wrote:
> > > @@ -639,6 +694,9 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> > >  	pagevec_free(&freed_pvec);
> > >  }
> > >  
> > > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> > > +#define MAX_SWAP_CLEAN_WAIT 50
> > > +
> > >  /*
> > >   * shrink_page_list() returns the number of reclaimed pages
> > >   */
> > > @@ -646,13 +704,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > >  					struct scan_control *sc,
> > >  					enum pageout_io sync_writeback)
> > >  {
> > > -	LIST_HEAD(ret_pages);
> > >  	LIST_HEAD(free_pages);
> > > -	int pgactivate = 0;
> > > +	LIST_HEAD(putback_pages);
> > > +	LIST_HEAD(dirty_pages);
> > > +	int pgactivate;
> > > +	int dirty_isolated = 0;
> > > +	unsigned long nr_dirty;
> > >  	unsigned long nr_reclaimed = 0;
> > >  
> > > +	pgactivate = 0;
> > >  	cond_resched();
> > >  
> > > +restart_dirty:
> > > +	nr_dirty = 0;
> > >  	while (!list_empty(page_list)) {
> > >  		enum page_references references;
> > >  		struct address_space *mapping;
> > > @@ -741,7 +805,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > >  			}
> > >  		}
> > >  
> > > -		if (PageDirty(page)) {
> > > +		if (PageDirty(page))  {
> > > +			/*
> > > +			 * If the caller cannot writeback pages, dirty pages
> > > +			 * are put on a separate list for cleaning by either
> > > +			 * a flusher thread or kswapd
> > > +			 */
> > > +			if (!reclaim_can_writeback(sc, page)) {
> > > +				list_add(&page->lru, &dirty_pages);
> > > +				unlock_page(page);
> > > +				nr_dirty++;
> > > +				goto keep_dirty;
> > > +			}
> > > +
> > >  			if (references == PAGEREF_RECLAIM_CLEAN)
> > >  				goto keep_locked;
> > >  			if (!may_enter_fs)
> > > @@ -852,13 +928,39 @@ activate_locked:
> > >  keep_locked:
> > >  		unlock_page(page);
> > >  keep:
> > > -		list_add(&page->lru, &ret_pages);
> > > +		list_add(&page->lru, &putback_pages);
> > > +keep_dirty:
> > >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > >  	}
> > >  
> > > +	if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
> > > +		/*
> > > +		 * Wakeup a flusher thread to clean at least as many dirty
> > > +		 * pages as encountered by direct reclaim. Wait on congestion
> > > +		 * to throttle processes cleaning dirty pages
> > > +		 */
> > > +		wakeup_flusher_threads(nr_dirty);
> > > +		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > +
> > > +		/*
> > > +		 * As lumpy reclaim and memcg targets specific pages, wait on
> > > +		 * them to be cleaned and try reclaim again.
> > > +		 */
> > > +		if (sync_writeback == PAGEOUT_IO_SYNC ||
> > > +						sc->mem_cgroup != NULL) {
> > > +			dirty_isolated++;
> > > +			list_splice(&dirty_pages, page_list);
> > > +			INIT_LIST_HEAD(&dirty_pages);
> > > +			goto restart_dirty;
> > > +		}
> > > +	}
> > 
> > I think it would turn out more natural to just return dirty pages on
> > page_list and have the whole looping logic in shrink_inactive_list().
> > 
> > Mixing dirty pages with other 'please try again' pages is probably not
> > so bad anyway, it means we could retry all temporary unavailable pages
> > instead of twiddling thumbs over that particular bunch of pages until
> > the flushers catch up.
> > 
> > What do you think?
> > 
> 
> It's worth considering! It won't be very tidy but it's workable. The reason
> it is not tidy is that dirty pages and pages that couldn't be paged will be
> on the same list so they whole lot will need to be recycled. We'd record in
> scan_control though that there were pages that need to be retried and loop
> based on that value. That is managable though.

Recycling all of them is what I had in mind, yeah.  But...

> The reason why I did it this way was because of lumpy reclaim and memcg
> requiring specific pages. I considered lumpy reclaim to be the more common
> case. In that case, it's removing potentially a large number of pages from
> the LRU that are contiguous. If some of those are dirty and it selects more
> contiguous ranges for reclaim, I'd worry that lumpy reclaim would trash the
> system even worse than it currently does when the system is under load. Hence,
> this wait and retry loop is done instead of returning and isolating more pages.

I think here we missed each other.  I don't want the loop to be _that_
much more in the outer scope that isolation is repeated as well.  What
I had in mind is the attached patch.  It is not tested and hacked up
rather quickly due to time constraints, sorry, but you should get the
idea.  I hope I did not miss anything fundamental.

Note that since only kswapd enters pageout() anymore, everything
depending on PAGEOUT_IO_SYNC in there is moot, since there are no sync
cycles for kswapd.  Just to mitigate the WTF-count on the patch :-)

	Hannes

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -386,21 +386,17 @@ static pageout_t pageout(struct page *pa
 			ClearPageReclaim(page);
 			return PAGE_ACTIVATE;
 		}
-
-		/*
-		 * Wait on writeback if requested to. This happens when
-		 * direct reclaiming a large contiguous area and the
-		 * first attempt to free a range of pages fails.
-		 */
-		if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
-			wait_on_page_writeback(page);
-
 		if (!PageWriteback(page)) {
 			/* synchronous write or broken a_ops? */
 			ClearPageReclaim(page);
 		}
 		trace_mm_vmscan_writepage(page,
 			page_is_file_cache(page),
+			/*
+			 * Humm.  Only kswapd comes here and for
+			 * kswapd there never is a PAGEOUT_IO_SYNC
+			 * cycle...
+			 */
 			sync_writeback == PAGEOUT_IO_SYNC);
 		inc_zone_page_state(page, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
@@ -643,12 +639,14 @@ static noinline_for_stack void free_page
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
-					struct scan_control *sc,
-					enum pageout_io sync_writeback)
+				      struct scan_control *sc,
+				      enum pageout_io sync_writeback,
+				      int *dirty_seen)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
 	int pgactivate = 0;
+	unsigned long nr_dirty = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
@@ -657,7 +655,7 @@ static unsigned long shrink_page_list(st
 		enum page_references references;
 		struct address_space *mapping;
 		struct page *page;
-		int may_enter_fs;
+		int may_pageout;
 
 		cond_resched();
 
@@ -681,10 +679,15 @@ static unsigned long shrink_page_list(st
 		if (page_mapped(page) || PageSwapCache(page))
 			sc->nr_scanned++;
 
-		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
+		/*
+		 * To prevent stack overflows, only kswapd can enter
+		 * the filesystem.  Swap IO is always fine (for now).
+		 */
+		may_pageout = current_is_kswapd() ||
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 
 		if (PageWriteback(page)) {
+			int may_wait;
 			/*
 			 * Synchronous reclaim is performed in two passes,
 			 * first an asynchronous pass over the list to
@@ -693,7 +696,8 @@ static unsigned long shrink_page_list(st
 			 * for any page for which writeback has already
 			 * started.
 			 */
-			if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
+			may_wait = (sc->gfp_mask & __GFP_FS) || may_pageout;
+			if (sync_writeback == PAGEOUT_IO_SYNC && may_wait)
 				wait_on_page_writeback(page);
 			else
 				goto keep_locked;
@@ -719,7 +723,7 @@ static unsigned long shrink_page_list(st
 				goto keep_locked;
 			if (!add_to_swap(page))
 				goto activate_locked;
-			may_enter_fs = 1;
+			may_pageout = 1;
 		}
 
 		mapping = page_mapping(page);
@@ -742,9 +746,11 @@ static unsigned long shrink_page_list(st
 		}
 
 		if (PageDirty(page)) {
+			nr_dirty++;
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
-			if (!may_enter_fs)
+			if (!may_pageout)
 				goto keep_locked;
 			if (!sc->may_writepage)
 				goto keep_locked;
@@ -860,6 +866,7 @@ keep:
 
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
+	*dirty_seen = nr_dirty;
 	return nr_reclaimed;
 }
 
@@ -1232,6 +1239,9 @@ static noinline_for_stack void update_is
 	reclaim_stat->recent_scanned[1] += *nr_file;
 }
 
+/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
+#define MAX_SWAP_CLEAN_WAIT 50
+
 /*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
@@ -1247,6 +1257,7 @@ shrink_inactive_list(unsigned long nr_to
 	unsigned long nr_active;
 	unsigned long nr_anon;
 	unsigned long nr_file;
+	unsigned long nr_dirty;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1295,26 +1306,34 @@ shrink_inactive_list(unsigned long nr_to
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
-
+	nr_reclaimed = shrink_page_list(&page_list, sc,
+					PAGEOUT_IO_ASYNC,
+					&nr_dirty);
 	/*
 	 * If we are direct reclaiming for contiguous pages and we do
 	 * not reclaim everything in the list, try again and wait
 	 * for IO to complete. This will stall high-order allocations
 	 * but that should be acceptable to the caller
 	 */
-	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
-			sc->lumpy_reclaim_mode) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+	if (!current_is_kswapd() && sc->lumpy_reclaim_mode || sc->mem_cgroup) {
+		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
 
-		/*
-		 * The attempt at page out may have made some
-		 * of the pages active, mark them inactive again.
-		 */
-		nr_active = clear_active_flags(&page_list, NULL);
-		count_vm_events(PGDEACTIVATE, nr_active);
+		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
+			wakeup_flusher_threads(nr_dirty);
+			congestion_wait(BLK_RW_ASYNC, HZ/10);
+			/*
+			 * The attempt at page out may have made some
+			 * of the pages active, mark them inactive again.
+			 *
+			 * Humm.  Still needed?
+			 */
+			nr_active = clear_active_flags(&page_list, NULL);
+			count_vm_events(PGDEACTIVATE, nr_active);
 
-		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+			nr_reclaimed += shrink_page_list(&page_list, sc,
+							 PAGEOUT_IO_SYNC,
+							 &nr_dirty);
+		}
 	}
 
 	local_irq_disable();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-20 14:10           ` Mel Gorman
@ 2010-07-20 22:05             ` Johannes Weiner
  -1 siblings, 0 replies; 177+ messages in thread
From: Johannes Weiner @ 2010-07-20 22:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jul 20, 2010 at 03:10:49PM +0100, Mel Gorman wrote:
> On Tue, Jul 20, 2010 at 12:48:39AM +0200, Johannes Weiner wrote:
> > On Mon, Jul 19, 2010 at 03:37:37PM +0100, Mel Gorman wrote:
> > > although the latter can get disabled too. Deleting the magic is an
> > > option which would trade IO efficiency for power efficiency but my
> > > current thinking is laptop mode preferred reduced power.
> > 
> > Maybe couple your wakeup with sc->may_writepage?  It is usually false
> > for laptop_mode but direct reclaimers enable it at one point in
> > do_try_to_free_pages() when it scanned more than 150% of the reclaim
> > target, so you could use existing disk spin-up points instead of
> > introducing new ones or disabling the heuristics in laptop mode.
> > 
> 
> How about the following?
> 
>         if (nr_dirty && sc->may_writepage)
>                 wakeup_flusher_threads(laptop_mode ? 0 :
>                                                 nr_dirty + nr_dirty / 2);
> 
> 
> 1. Wakup flusher threads if dirty pages are encountered
> 2. For direct reclaim, only wake them up if may_writepage is set
>    indicating that the system is ready to spin up disks and start
>    reclaiming
> 3. In laptop_mode, flush everything to reduce future spin-ups

Sounds like the sanest approach to me.  Thanks.

	Hannes

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-20 22:05             ` Johannes Weiner
  0 siblings, 0 replies; 177+ messages in thread
From: Johannes Weiner @ 2010-07-20 22:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jul 20, 2010 at 03:10:49PM +0100, Mel Gorman wrote:
> On Tue, Jul 20, 2010 at 12:48:39AM +0200, Johannes Weiner wrote:
> > On Mon, Jul 19, 2010 at 03:37:37PM +0100, Mel Gorman wrote:
> > > although the latter can get disabled too. Deleting the magic is an
> > > option which would trade IO efficiency for power efficiency but my
> > > current thinking is laptop mode preferred reduced power.
> > 
> > Maybe couple your wakeup with sc->may_writepage?  It is usually false
> > for laptop_mode but direct reclaimers enable it at one point in
> > do_try_to_free_pages() when it scanned more than 150% of the reclaim
> > target, so you could use existing disk spin-up points instead of
> > introducing new ones or disabling the heuristics in laptop mode.
> > 
> 
> How about the following?
> 
>         if (nr_dirty && sc->may_writepage)
>                 wakeup_flusher_threads(laptop_mode ? 0 :
>                                                 nr_dirty + nr_dirty / 2);
> 
> 
> 1. Wakup flusher threads if dirty pages are encountered
> 2. For direct reclaim, only wake them up if may_writepage is set
>    indicating that the system is ready to spin up disks and start
>    reclaiming
> 3. In laptop_mode, flush everything to reduce future spin-ups

Sounds like the sanest approach to me.  Thanks.

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-20 22:02         ` Johannes Weiner
@ 2010-07-21 11:36           ` Johannes Weiner
  -1 siblings, 0 replies; 177+ messages in thread
From: Johannes Weiner @ 2010-07-21 11:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, Jul 21, 2010 at 12:02:18AM +0200, Johannes Weiner wrote:
> On Tue, Jul 20, 2010 at 02:45:56PM +0100, Mel Gorman wrote:
> > On Tue, Jul 20, 2010 at 12:14:20AM +0200, Johannes Weiner wrote:
> > > I think it would turn out more natural to just return dirty pages on
> > > page_list and have the whole looping logic in shrink_inactive_list().
> > > 
> > > Mixing dirty pages with other 'please try again' pages is probably not
> > > so bad anyway, it means we could retry all temporary unavailable pages
> > > instead of twiddling thumbs over that particular bunch of pages until
> > > the flushers catch up.
> > > 
> > > What do you think?
> > > 
[...]
> > The reason why I did it this way was because of lumpy reclaim and memcg
> > requiring specific pages. I considered lumpy reclaim to be the more common
> > case. In that case, it's removing potentially a large number of pages from
> > the LRU that are contiguous. If some of those are dirty and it selects more
> > contiguous ranges for reclaim, I'd worry that lumpy reclaim would trash the
> > system even worse than it currently does when the system is under load. Hence,
> > this wait and retry loop is done instead of returning and isolating more pages.
> 
> I think here we missed each other.  I don't want the loop to be _that_
> much more in the outer scope that isolation is repeated as well.  What
> I had in mind is the attached patch.  It is not tested and hacked up
> rather quickly due to time constraints, sorry, but you should get the
> idea.  I hope I did not miss anything fundamental.
> 
> Note that since only kswapd enters pageout() anymore, everything
> depending on PAGEOUT_IO_SYNC in there is moot, since there are no sync
> cycles for kswapd.  Just to mitigate the WTF-count on the patch :-)

Aaaaand direct reclaimers for swap, of course.  Selfslap.  Here is the
patch again, sans the first hunk (and the type of @dirty_seen fixed):

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -643,12 +643,14 @@ static noinline_for_stack void free_page
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
-					struct scan_control *sc,
-					enum pageout_io sync_writeback)
+				      struct scan_control *sc,
+				      enum pageout_io sync_writeback,
+				      unsigned long *dirty_seen)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
 	int pgactivate = 0;
+	unsigned long nr_dirty = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
@@ -657,7 +659,7 @@ static unsigned long shrink_page_list(st
 		enum page_references references;
 		struct address_space *mapping;
 		struct page *page;
-		int may_enter_fs;
+		int may_pageout;
 
 		cond_resched();
 
@@ -681,10 +683,15 @@ static unsigned long shrink_page_list(st
 		if (page_mapped(page) || PageSwapCache(page))
 			sc->nr_scanned++;
 
-		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
+		/*
+		 * To prevent stack overflows, only kswapd can enter
+		 * the filesystem.  Swap IO is always fine (for now).
+		 */
+		may_pageout = current_is_kswapd() ||
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 
 		if (PageWriteback(page)) {
+			int may_wait;
 			/*
 			 * Synchronous reclaim is performed in two passes,
 			 * first an asynchronous pass over the list to
@@ -693,7 +700,8 @@ static unsigned long shrink_page_list(st
 			 * for any page for which writeback has already
 			 * started.
 			 */
-			if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
+			may_wait = (sc->gfp_mask & __GFP_FS) || may_pageout;
+			if (sync_writeback == PAGEOUT_IO_SYNC && may_wait)
 				wait_on_page_writeback(page);
 			else
 				goto keep_locked;
@@ -719,7 +727,7 @@ static unsigned long shrink_page_list(st
 				goto keep_locked;
 			if (!add_to_swap(page))
 				goto activate_locked;
-			may_enter_fs = 1;
+			may_pageout = 1;
 		}
 
 		mapping = page_mapping(page);
@@ -742,9 +750,11 @@ static unsigned long shrink_page_list(st
 		}
 
 		if (PageDirty(page)) {
+			nr_dirty++;
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
-			if (!may_enter_fs)
+			if (!may_pageout)
 				goto keep_locked;
 			if (!sc->may_writepage)
 				goto keep_locked;
@@ -860,6 +870,7 @@ keep:
 
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
+	*dirty_seen = nr_dirty;
 	return nr_reclaimed;
 }
 
@@ -1232,6 +1243,9 @@ static noinline_for_stack void update_is
 	reclaim_stat->recent_scanned[1] += *nr_file;
 }
 
+/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
+#define MAX_SWAP_CLEAN_WAIT 50
+
 /*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
@@ -1247,6 +1261,7 @@ shrink_inactive_list(unsigned long nr_to
 	unsigned long nr_active;
 	unsigned long nr_anon;
 	unsigned long nr_file;
+	unsigned long nr_dirty;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1295,26 +1310,32 @@ shrink_inactive_list(unsigned long nr_to
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
-
+	nr_reclaimed = shrink_page_list(&page_list, sc,
+					PAGEOUT_IO_ASYNC,
+					&nr_dirty);
 	/*
 	 * If we are direct reclaiming for contiguous pages and we do
 	 * not reclaim everything in the list, try again and wait
 	 * for IO to complete. This will stall high-order allocations
 	 * but that should be acceptable to the caller
 	 */
-	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
-			sc->lumpy_reclaim_mode) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+	if (!current_is_kswapd() && sc->lumpy_reclaim_mode || sc->mem_cgroup) {
+		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
 
-		/*
-		 * The attempt at page out may have made some
-		 * of the pages active, mark them inactive again.
-		 */
-		nr_active = clear_active_flags(&page_list, NULL);
-		count_vm_events(PGDEACTIVATE, nr_active);
+		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
+			wakeup_flusher_threads(nr_dirty);
+			congestion_wait(BLK_RW_ASYNC, HZ/10);
+			/*
+			 * The attempt at page out may have made some
+			 * of the pages active, mark them inactive again.
+			 */
+			nr_active = clear_active_flags(&page_list, NULL);
+			count_vm_events(PGDEACTIVATE, nr_active);
 
-		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+			nr_reclaimed += shrink_page_list(&page_list, sc,
+							 PAGEOUT_IO_SYNC,
+							 &nr_dirty);
+		}
 	}
 
 	local_irq_disable();

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-21 11:36           ` Johannes Weiner
  0 siblings, 0 replies; 177+ messages in thread
From: Johannes Weiner @ 2010-07-21 11:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, Jul 21, 2010 at 12:02:18AM +0200, Johannes Weiner wrote:
> On Tue, Jul 20, 2010 at 02:45:56PM +0100, Mel Gorman wrote:
> > On Tue, Jul 20, 2010 at 12:14:20AM +0200, Johannes Weiner wrote:
> > > I think it would turn out more natural to just return dirty pages on
> > > page_list and have the whole looping logic in shrink_inactive_list().
> > > 
> > > Mixing dirty pages with other 'please try again' pages is probably not
> > > so bad anyway, it means we could retry all temporary unavailable pages
> > > instead of twiddling thumbs over that particular bunch of pages until
> > > the flushers catch up.
> > > 
> > > What do you think?
> > > 
[...]
> > The reason why I did it this way was because of lumpy reclaim and memcg
> > requiring specific pages. I considered lumpy reclaim to be the more common
> > case. In that case, it's removing potentially a large number of pages from
> > the LRU that are contiguous. If some of those are dirty and it selects more
> > contiguous ranges for reclaim, I'd worry that lumpy reclaim would trash the
> > system even worse than it currently does when the system is under load. Hence,
> > this wait and retry loop is done instead of returning and isolating more pages.
> 
> I think here we missed each other.  I don't want the loop to be _that_
> much more in the outer scope that isolation is repeated as well.  What
> I had in mind is the attached patch.  It is not tested and hacked up
> rather quickly due to time constraints, sorry, but you should get the
> idea.  I hope I did not miss anything fundamental.
> 
> Note that since only kswapd enters pageout() anymore, everything
> depending on PAGEOUT_IO_SYNC in there is moot, since there are no sync
> cycles for kswapd.  Just to mitigate the WTF-count on the patch :-)

Aaaaand direct reclaimers for swap, of course.  Selfslap.  Here is the
patch again, sans the first hunk (and the type of @dirty_seen fixed):

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -643,12 +643,14 @@ static noinline_for_stack void free_page
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
-					struct scan_control *sc,
-					enum pageout_io sync_writeback)
+				      struct scan_control *sc,
+				      enum pageout_io sync_writeback,
+				      unsigned long *dirty_seen)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
 	int pgactivate = 0;
+	unsigned long nr_dirty = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
@@ -657,7 +659,7 @@ static unsigned long shrink_page_list(st
 		enum page_references references;
 		struct address_space *mapping;
 		struct page *page;
-		int may_enter_fs;
+		int may_pageout;
 
 		cond_resched();
 
@@ -681,10 +683,15 @@ static unsigned long shrink_page_list(st
 		if (page_mapped(page) || PageSwapCache(page))
 			sc->nr_scanned++;
 
-		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
+		/*
+		 * To prevent stack overflows, only kswapd can enter
+		 * the filesystem.  Swap IO is always fine (for now).
+		 */
+		may_pageout = current_is_kswapd() ||
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 
 		if (PageWriteback(page)) {
+			int may_wait;
 			/*
 			 * Synchronous reclaim is performed in two passes,
 			 * first an asynchronous pass over the list to
@@ -693,7 +700,8 @@ static unsigned long shrink_page_list(st
 			 * for any page for which writeback has already
 			 * started.
 			 */
-			if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
+			may_wait = (sc->gfp_mask & __GFP_FS) || may_pageout;
+			if (sync_writeback == PAGEOUT_IO_SYNC && may_wait)
 				wait_on_page_writeback(page);
 			else
 				goto keep_locked;
@@ -719,7 +727,7 @@ static unsigned long shrink_page_list(st
 				goto keep_locked;
 			if (!add_to_swap(page))
 				goto activate_locked;
-			may_enter_fs = 1;
+			may_pageout = 1;
 		}
 
 		mapping = page_mapping(page);
@@ -742,9 +750,11 @@ static unsigned long shrink_page_list(st
 		}
 
 		if (PageDirty(page)) {
+			nr_dirty++;
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
-			if (!may_enter_fs)
+			if (!may_pageout)
 				goto keep_locked;
 			if (!sc->may_writepage)
 				goto keep_locked;
@@ -860,6 +870,7 @@ keep:
 
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
+	*dirty_seen = nr_dirty;
 	return nr_reclaimed;
 }
 
@@ -1232,6 +1243,9 @@ static noinline_for_stack void update_is
 	reclaim_stat->recent_scanned[1] += *nr_file;
 }
 
+/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
+#define MAX_SWAP_CLEAN_WAIT 50
+
 /*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
@@ -1247,6 +1261,7 @@ shrink_inactive_list(unsigned long nr_to
 	unsigned long nr_active;
 	unsigned long nr_anon;
 	unsigned long nr_file;
+	unsigned long nr_dirty;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1295,26 +1310,32 @@ shrink_inactive_list(unsigned long nr_to
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
-
+	nr_reclaimed = shrink_page_list(&page_list, sc,
+					PAGEOUT_IO_ASYNC,
+					&nr_dirty);
 	/*
 	 * If we are direct reclaiming for contiguous pages and we do
 	 * not reclaim everything in the list, try again and wait
 	 * for IO to complete. This will stall high-order allocations
 	 * but that should be acceptable to the caller
 	 */
-	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
-			sc->lumpy_reclaim_mode) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+	if (!current_is_kswapd() && sc->lumpy_reclaim_mode || sc->mem_cgroup) {
+		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
 
-		/*
-		 * The attempt at page out may have made some
-		 * of the pages active, mark them inactive again.
-		 */
-		nr_active = clear_active_flags(&page_list, NULL);
-		count_vm_events(PGDEACTIVATE, nr_active);
+		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
+			wakeup_flusher_threads(nr_dirty);
+			congestion_wait(BLK_RW_ASYNC, HZ/10);
+			/*
+			 * The attempt at page out may have made some
+			 * of the pages active, mark them inactive again.
+			 */
+			nr_active = clear_active_flags(&page_list, NULL);
+			count_vm_events(PGDEACTIVATE, nr_active);
 
-		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+			nr_reclaimed += shrink_page_list(&page_list, sc,
+							 PAGEOUT_IO_SYNC,
+							 &nr_dirty);
+		}
 	}
 
 	local_irq_disable();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-20 22:02         ` Johannes Weiner
@ 2010-07-21 11:52           ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-21 11:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, Jul 21, 2010 at 12:02:18AM +0200, Johannes Weiner wrote:
> On Tue, Jul 20, 2010 at 02:45:56PM +0100, Mel Gorman wrote:
> > On Tue, Jul 20, 2010 at 12:14:20AM +0200, Johannes Weiner wrote:
> > > > @@ -639,6 +694,9 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> > > >  	pagevec_free(&freed_pvec);
> > > >  }
> > > >  
> > > > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> > > > +#define MAX_SWAP_CLEAN_WAIT 50
> > > > +
> > > >  /*
> > > >   * shrink_page_list() returns the number of reclaimed pages
> > > >   */
> > > > @@ -646,13 +704,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > > >  					struct scan_control *sc,
> > > >  					enum pageout_io sync_writeback)
> > > >  {
> > > > -	LIST_HEAD(ret_pages);
> > > >  	LIST_HEAD(free_pages);
> > > > -	int pgactivate = 0;
> > > > +	LIST_HEAD(putback_pages);
> > > > +	LIST_HEAD(dirty_pages);
> > > > +	int pgactivate;
> > > > +	int dirty_isolated = 0;
> > > > +	unsigned long nr_dirty;
> > > >  	unsigned long nr_reclaimed = 0;
> > > >  
> > > > +	pgactivate = 0;
> > > >  	cond_resched();
> > > >  
> > > > +restart_dirty:
> > > > +	nr_dirty = 0;
> > > >  	while (!list_empty(page_list)) {
> > > >  		enum page_references references;
> > > >  		struct address_space *mapping;
> > > > @@ -741,7 +805,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > > >  			}
> > > >  		}
> > > >  
> > > > -		if (PageDirty(page)) {
> > > > +		if (PageDirty(page))  {
> > > > +			/*
> > > > +			 * If the caller cannot writeback pages, dirty pages
> > > > +			 * are put on a separate list for cleaning by either
> > > > +			 * a flusher thread or kswapd
> > > > +			 */
> > > > +			if (!reclaim_can_writeback(sc, page)) {
> > > > +				list_add(&page->lru, &dirty_pages);
> > > > +				unlock_page(page);
> > > > +				nr_dirty++;
> > > > +				goto keep_dirty;
> > > > +			}
> > > > +
> > > >  			if (references == PAGEREF_RECLAIM_CLEAN)
> > > >  				goto keep_locked;
> > > >  			if (!may_enter_fs)
> > > > @@ -852,13 +928,39 @@ activate_locked:
> > > >  keep_locked:
> > > >  		unlock_page(page);
> > > >  keep:
> > > > -		list_add(&page->lru, &ret_pages);
> > > > +		list_add(&page->lru, &putback_pages);
> > > > +keep_dirty:
> > > >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > >  	}
> > > >  
> > > > +	if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
> > > > +		/*
> > > > +		 * Wakeup a flusher thread to clean at least as many dirty
> > > > +		 * pages as encountered by direct reclaim. Wait on congestion
> > > > +		 * to throttle processes cleaning dirty pages
> > > > +		 */
> > > > +		wakeup_flusher_threads(nr_dirty);
> > > > +		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > +
> > > > +		/*
> > > > +		 * As lumpy reclaim and memcg targets specific pages, wait on
> > > > +		 * them to be cleaned and try reclaim again.
> > > > +		 */
> > > > +		if (sync_writeback == PAGEOUT_IO_SYNC ||
> > > > +						sc->mem_cgroup != NULL) {
> > > > +			dirty_isolated++;
> > > > +			list_splice(&dirty_pages, page_list);
> > > > +			INIT_LIST_HEAD(&dirty_pages);
> > > > +			goto restart_dirty;
> > > > +		}
> > > > +	}
> > > 
> > > I think it would turn out more natural to just return dirty pages on
> > > page_list and have the whole looping logic in shrink_inactive_list().
> > > 
> > > Mixing dirty pages with other 'please try again' pages is probably not
> > > so bad anyway, it means we could retry all temporary unavailable pages
> > > instead of twiddling thumbs over that particular bunch of pages until
> > > the flushers catch up.
> > > 
> > > What do you think?
> > > 
> > 
> > It's worth considering! It won't be very tidy but it's workable. The reason
> > it is not tidy is that dirty pages and pages that couldn't be paged will be
> > on the same list so they whole lot will need to be recycled. We'd record in
> > scan_control though that there were pages that need to be retried and loop
> > based on that value. That is managable though.
> 
> Recycling all of them is what I had in mind, yeah.  But...
> 
> > The reason why I did it this way was because of lumpy reclaim and memcg
> > requiring specific pages. I considered lumpy reclaim to be the more common
> > case. In that case, it's removing potentially a large number of pages from
> > the LRU that are contiguous. If some of those are dirty and it selects more
> > contiguous ranges for reclaim, I'd worry that lumpy reclaim would trash the
> > system even worse than it currently does when the system is under load. Hence,
> > this wait and retry loop is done instead of returning and isolating more pages.
> 
> I think here we missed each other.  I don't want the loop to be _that_
> much more in the outer scope that isolation is repeated as well. 

My bad.

> What
> I had in mind is the attached patch.  It is not tested and hacked up
> rather quickly due to time constraints, sorry, but you should get the
> idea.  I hope I did not miss anything fundamental.
> 
> Note that since only kswapd enters pageout() anymore, everything
> depending on PAGEOUT_IO_SYNC in there is moot, since there are no sync
> cycles for kswapd.  Just to mitigate the WTF-count on the patch :-)
> 

Anon page writeback can enter pageout. See

static inline bool reclaim_can_writeback(struct scan_control *sc,
                                        struct page *page)
{
        return !page_is_file_cache(page) || current_is_kswapd();
}

So the logic still applies.


> 	Hannes
> 
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -386,21 +386,17 @@ static pageout_t pageout(struct page *pa
>  			ClearPageReclaim(page);
>  			return PAGE_ACTIVATE;
>  		}
> -
> -		/*
> -		 * Wait on writeback if requested to. This happens when
> -		 * direct reclaiming a large contiguous area and the
> -		 * first attempt to free a range of pages fails.
> -		 */
> -		if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
> -			wait_on_page_writeback(page);
> -

I'm assuming this should still remain because it can apply to anon page
writeback (i.e. being swapped)?

>  		if (!PageWriteback(page)) {
>  			/* synchronous write or broken a_ops? */
>  			ClearPageReclaim(page);
>  		}
>  		trace_mm_vmscan_writepage(page,
>  			page_is_file_cache(page),
> +			/*
> +			 * Humm.  Only kswapd comes here and for
> +			 * kswapd there never is a PAGEOUT_IO_SYNC
> +			 * cycle...
> +			 */
>  			sync_writeback == PAGEOUT_IO_SYNC);
>  		inc_zone_page_state(page, NR_VMSCAN_WRITE);

To clarify, see the following example of writeback stats - the anon sync
I/O in particular

Direct reclaim pages scanned                156940     150720     145472 142254 
Direct reclaim write file async I/O           2472          0          0 0 
Direct reclaim write anon async I/O          29281      27195      27968 25519 
Direct reclaim write file sync I/O            1943          0          0 0 
Direct reclaim write anon sync I/O           11777      12488      10835 4806 

>  		return PAGE_SUCCESS;
> @@ -643,12 +639,14 @@ static noinline_for_stack void free_page
>   * shrink_page_list() returns the number of reclaimed pages
>   */
>  static unsigned long shrink_page_list(struct list_head *page_list,
> -					struct scan_control *sc,
> -					enum pageout_io sync_writeback)
> +				      struct scan_control *sc,
> +				      enum pageout_io sync_writeback,
> +				      int *dirty_seen)
>  {
>  	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
>  	int pgactivate = 0;
> +	unsigned long nr_dirty = 0;
>  	unsigned long nr_reclaimed = 0;
>  
>  	cond_resched();
> @@ -657,7 +655,7 @@ static unsigned long shrink_page_list(st
>  		enum page_references references;
>  		struct address_space *mapping;
>  		struct page *page;
> -		int may_enter_fs;
> +		int may_pageout;
>  
>  		cond_resched();
>  
> @@ -681,10 +679,15 @@ static unsigned long shrink_page_list(st
>  		if (page_mapped(page) || PageSwapCache(page))
>  			sc->nr_scanned++;
>  
> -		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> +		/*
> +		 * To prevent stack overflows, only kswapd can enter
> +		 * the filesystem.  Swap IO is always fine (for now).
> +		 */
> +		may_pageout = current_is_kswapd() ||
>  			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
>  

We lost the __GFP_FS check and it's vaguely possible kswapd could call the
allocator with GFP_NOFS. While you check it before wait_on_page_writeback it
needs to be checked before calling pageout(). I toyed around with
creating a may_pageout that took everything into account but I couldn't
convince myself there was no holes or serious change in functionality.

>  		if (PageWriteback(page)) {
> +			int may_wait;
>  			/*
>  			 * Synchronous reclaim is performed in two passes,
>  			 * first an asynchronous pass over the list to
> @@ -693,7 +696,8 @@ static unsigned long shrink_page_list(st
>  			 * for any page for which writeback has already
>  			 * started.
>  			 */
> -			if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
> +			may_wait = (sc->gfp_mask & __GFP_FS) || may_pageout;
> +			if (sync_writeback == PAGEOUT_IO_SYNC && may_wait)
>  				wait_on_page_writeback(page);
>  			else
>  				goto keep_locked;
> @@ -719,7 +723,7 @@ static unsigned long shrink_page_list(st
>  				goto keep_locked;
>  			if (!add_to_swap(page))
>  				goto activate_locked;
> -			may_enter_fs = 1;
> +			may_pageout = 1;
>  		}
>  
>  		mapping = page_mapping(page);
> @@ -742,9 +746,11 @@ static unsigned long shrink_page_list(st
>  		}
>  
>  		if (PageDirty(page)) {
> +			nr_dirty++;
> +
>  			if (references == PAGEREF_RECLAIM_CLEAN)
>  				goto keep_locked;
> -			if (!may_enter_fs)
> +			if (!may_pageout)
>  				goto keep_locked;
>  			if (!sc->may_writepage)
>  				goto keep_locked;
> @@ -860,6 +866,7 @@ keep:
>  
>  	list_splice(&ret_pages, page_list);
>  	count_vm_events(PGACTIVATE, pgactivate);
> +	*dirty_seen = nr_dirty;
>  	return nr_reclaimed;
>  }
>  
> @@ -1232,6 +1239,9 @@ static noinline_for_stack void update_is
>  	reclaim_stat->recent_scanned[1] += *nr_file;
>  }
>  
> +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> +#define MAX_SWAP_CLEAN_WAIT 50
> +
>  /*
>   * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
>   * of reclaimed pages
> @@ -1247,6 +1257,7 @@ shrink_inactive_list(unsigned long nr_to
>  	unsigned long nr_active;
>  	unsigned long nr_anon;
>  	unsigned long nr_file;
> +	unsigned long nr_dirty;
>  
>  	while (unlikely(too_many_isolated(zone, file, sc))) {
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -1295,26 +1306,34 @@ shrink_inactive_list(unsigned long nr_to
>  
>  	spin_unlock_irq(&zone->lru_lock);
>  
> -	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> -
> +	nr_reclaimed = shrink_page_list(&page_list, sc,
> +					PAGEOUT_IO_ASYNC,
> +					&nr_dirty);
>  	/*
>  	 * If we are direct reclaiming for contiguous pages and we do
>  	 * not reclaim everything in the list, try again and wait
>  	 * for IO to complete. This will stall high-order allocations
>  	 * but that should be acceptable to the caller
>  	 */
> -	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> -			sc->lumpy_reclaim_mode) {
> -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> +	if (!current_is_kswapd() && sc->lumpy_reclaim_mode || sc->mem_cgroup) {
> +		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
>  
> -		/*
> -		 * The attempt at page out may have made some
> -		 * of the pages active, mark them inactive again.
> -		 */
> -		nr_active = clear_active_flags(&page_list, NULL);
> -		count_vm_events(PGDEACTIVATE, nr_active);
> +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> +			wakeup_flusher_threads(nr_dirty);
> +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> +			/*
> +			 * The attempt at page out may have made some
> +			 * of the pages active, mark them inactive again.
> +			 *
> +			 * Humm.  Still needed?
> +			 */
> +			nr_active = clear_active_flags(&page_list, NULL);
> +			count_vm_events(PGDEACTIVATE, nr_active);
>  

I don't see why it would be removed.

> -		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> +			nr_reclaimed += shrink_page_list(&page_list, sc,
> +							 PAGEOUT_IO_SYNC,
> +							 &nr_dirty);
> +		}
>  	}
>  
>  	local_irq_disable();

Ok, is this closer to what you had in mind?

==== CUT HERE ====
[PATCH] vmscan: Do not writeback filesystem pages in direct reclaim

When memory is under enough pressure, a process may enter direct
reclaim to free pages in the same manner kswapd does. If a dirty page is
encountered during the scan, this page is written to backing storage using
mapping->writepage. This can result in very deep call stacks, particularly
if the target storage or filesystem are complex. It has already been observed
on XFS that the stack overflows but the problem is not XFS-specific.

This patch prevents direct reclaim writing back filesystem pages by checking
if current is kswapd or the page is anonymous before writing back.  If the
dirty pages cannot be written back, they are placed back on the LRU lists
for either background writing by the BDI threads or kswapd. If in direct
lumpy reclaim and dirty pages are encountered, the process will stall for
the background flusher before trying to reclaim the pages again.

As the call-chain for writing anonymous pages is not expected to be deep
and they are not cleaned by flusher threads, anonymous pages are still
written back in direct reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6587155..e3a5816 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -323,6 +323,51 @@ typedef enum {
 	PAGE_CLEAN,
 } pageout_t;
 
+int write_reclaim_page(struct page *page, struct address_space *mapping,
+						enum pageout_io sync_writeback)
+{
+	int res;
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = SWAP_CLUSTER_MAX,
+		.range_start = 0,
+		.range_end = LLONG_MAX,
+		.nonblocking = 1,
+		.for_reclaim = 1,
+	};
+
+	if (!clear_page_dirty_for_io(page))
+		return PAGE_CLEAN;
+
+	SetPageReclaim(page);
+	res = mapping->a_ops->writepage(page, &wbc);
+	if (res < 0)
+		handle_write_error(mapping, page, res);
+	if (res == AOP_WRITEPAGE_ACTIVATE) {
+		ClearPageReclaim(page);
+		return PAGE_ACTIVATE;
+	}
+
+	/*
+	 * Wait on writeback if requested to. This happens when
+	 * direct reclaiming a large contiguous area and the
+	 * first attempt to free a range of pages fails.
+	 */
+	if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
+		wait_on_page_writeback(page);
+
+	if (!PageWriteback(page)) {
+		/* synchronous write or broken a_ops? */
+		ClearPageReclaim(page);
+	}
+	trace_mm_vmscan_writepage(page,
+		page_is_file_cache(page),
+		sync_writeback == PAGEOUT_IO_SYNC);
+	inc_zone_page_state(page, NR_VMSCAN_WRITE);
+
+	return PAGE_SUCCESS;
+}
+
 /*
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
@@ -367,46 +412,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 	if (!may_write_to_queue(mapping->backing_dev_info))
 		return PAGE_KEEP;
 
-	if (clear_page_dirty_for_io(page)) {
-		int res;
-		struct writeback_control wbc = {
-			.sync_mode = WB_SYNC_NONE,
-			.nr_to_write = SWAP_CLUSTER_MAX,
-			.range_start = 0,
-			.range_end = LLONG_MAX,
-			.nonblocking = 1,
-			.for_reclaim = 1,
-		};
-
-		SetPageReclaim(page);
-		res = mapping->a_ops->writepage(page, &wbc);
-		if (res < 0)
-			handle_write_error(mapping, page, res);
-		if (res == AOP_WRITEPAGE_ACTIVATE) {
-			ClearPageReclaim(page);
-			return PAGE_ACTIVATE;
-		}
-
-		/*
-		 * Wait on writeback if requested to. This happens when
-		 * direct reclaiming a large contiguous area and the
-		 * first attempt to free a range of pages fails.
-		 */
-		if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
-			wait_on_page_writeback(page);
-
-		if (!PageWriteback(page)) {
-			/* synchronous write or broken a_ops? */
-			ClearPageReclaim(page);
-		}
-		trace_mm_vmscan_writepage(page,
-			page_is_file_cache(page),
-			sync_writeback == PAGEOUT_IO_SYNC);
-		inc_zone_page_state(page, NR_VMSCAN_WRITE);
-		return PAGE_SUCCESS;
-	}
-
-	return PAGE_CLEAN;
+	return write_reclaim_page(page, mapping, sync_writeback);
 }
 
 /*
@@ -639,18 +645,25 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
 	pagevec_free(&freed_pvec);
 }
 
+/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
+#define MAX_SWAP_CLEAN_WAIT 50
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
 					struct scan_control *sc,
-					enum pageout_io sync_writeback)
+					enum pageout_io sync_writeback,
+					unsigned long *nr_still_dirty)
 {
-	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
-	int pgactivate = 0;
+	LIST_HEAD(putback_pages);
+	LIST_HEAD(dirty_pages);
+	int pgactivate;
+	unsigned long nr_dirty = 0;
 	unsigned long nr_reclaimed = 0;
 
+	pgactivate = 0;
 	cond_resched();
 
 	while (!list_empty(page_list)) {
@@ -741,7 +754,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (PageDirty(page)) {
+		if (PageDirty(page))  {
+			/*
+			 * Only kswapd can writeback filesystem pages to
+			 * avoid risk of stack overflow
+			 */
+			if (page_is_file_cache(page) && !current_is_kswapd()) {
+				list_add(&page->lru, &dirty_pages);
+				unlock_page(page);
+				nr_dirty++;
+				goto keep_dirty;
+			}
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
@@ -852,13 +876,19 @@ activate_locked:
 keep_locked:
 		unlock_page(page);
 keep:
-		list_add(&page->lru, &ret_pages);
+		list_add(&page->lru, &putback_pages);
+keep_dirty:
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
 
 	free_page_list(&free_pages);
 
-	list_splice(&ret_pages, page_list);
+	if (nr_dirty) {
+		*nr_still_dirty = nr_dirty;
+		list_splice(&dirty_pages, page_list);
+	}
+	list_splice(&putback_pages, page_list);
+
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
@@ -1245,6 +1275,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	unsigned long nr_active;
 	unsigned long nr_anon;
 	unsigned long nr_file;
+	unsigned long nr_dirty;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1293,26 +1324,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
+								&nr_dirty);
 
 	/*
-	 * If we are direct reclaiming for contiguous pages and we do
+	 * If specific pages are needed such as with direct reclaiming
+	 * for contiguous pages or for memory containers and we do
 	 * not reclaim everything in the list, try again and wait
-	 * for IO to complete. This will stall high-order allocations
-	 * but that should be acceptable to the caller
+	 * for IO to complete. This will stall callers that require
+	 * specific pages but it should be acceptable to the caller
 	 */
-	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
-			sc->lumpy_reclaim_mode) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+	if (sc->may_writepage && !current_is_kswapd() &&
+			(sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
+		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
 
-		/*
-		 * The attempt at page out may have made some
-		 * of the pages active, mark them inactive again.
-		 */
-		nr_active = clear_active_flags(&page_list, NULL);
-		count_vm_events(PGDEACTIVATE, nr_active);
+		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
+			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
+			congestion_wait(BLK_RW_ASYNC, HZ/10);
 
-		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+			/*
+			 * The attempt at page out may have made some
+			 * of the pages active, mark them inactive again.
+			 */
+			nr_active = clear_active_flags(&page_list, NULL);
+			count_vm_events(PGDEACTIVATE, nr_active);
+	
+			nr_reclaimed += shrink_page_list(&page_list, sc,
+						PAGEOUT_IO_SYNC, &nr_dirty);
+		}
 	}
 
 	local_irq_disable();


^ permalink raw reply related	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-21 11:52           ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-21 11:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, Jul 21, 2010 at 12:02:18AM +0200, Johannes Weiner wrote:
> On Tue, Jul 20, 2010 at 02:45:56PM +0100, Mel Gorman wrote:
> > On Tue, Jul 20, 2010 at 12:14:20AM +0200, Johannes Weiner wrote:
> > > > @@ -639,6 +694,9 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> > > >  	pagevec_free(&freed_pvec);
> > > >  }
> > > >  
> > > > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> > > > +#define MAX_SWAP_CLEAN_WAIT 50
> > > > +
> > > >  /*
> > > >   * shrink_page_list() returns the number of reclaimed pages
> > > >   */
> > > > @@ -646,13 +704,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > > >  					struct scan_control *sc,
> > > >  					enum pageout_io sync_writeback)
> > > >  {
> > > > -	LIST_HEAD(ret_pages);
> > > >  	LIST_HEAD(free_pages);
> > > > -	int pgactivate = 0;
> > > > +	LIST_HEAD(putback_pages);
> > > > +	LIST_HEAD(dirty_pages);
> > > > +	int pgactivate;
> > > > +	int dirty_isolated = 0;
> > > > +	unsigned long nr_dirty;
> > > >  	unsigned long nr_reclaimed = 0;
> > > >  
> > > > +	pgactivate = 0;
> > > >  	cond_resched();
> > > >  
> > > > +restart_dirty:
> > > > +	nr_dirty = 0;
> > > >  	while (!list_empty(page_list)) {
> > > >  		enum page_references references;
> > > >  		struct address_space *mapping;
> > > > @@ -741,7 +805,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > > >  			}
> > > >  		}
> > > >  
> > > > -		if (PageDirty(page)) {
> > > > +		if (PageDirty(page))  {
> > > > +			/*
> > > > +			 * If the caller cannot writeback pages, dirty pages
> > > > +			 * are put on a separate list for cleaning by either
> > > > +			 * a flusher thread or kswapd
> > > > +			 */
> > > > +			if (!reclaim_can_writeback(sc, page)) {
> > > > +				list_add(&page->lru, &dirty_pages);
> > > > +				unlock_page(page);
> > > > +				nr_dirty++;
> > > > +				goto keep_dirty;
> > > > +			}
> > > > +
> > > >  			if (references == PAGEREF_RECLAIM_CLEAN)
> > > >  				goto keep_locked;
> > > >  			if (!may_enter_fs)
> > > > @@ -852,13 +928,39 @@ activate_locked:
> > > >  keep_locked:
> > > >  		unlock_page(page);
> > > >  keep:
> > > > -		list_add(&page->lru, &ret_pages);
> > > > +		list_add(&page->lru, &putback_pages);
> > > > +keep_dirty:
> > > >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > >  	}
> > > >  
> > > > +	if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
> > > > +		/*
> > > > +		 * Wakeup a flusher thread to clean at least as many dirty
> > > > +		 * pages as encountered by direct reclaim. Wait on congestion
> > > > +		 * to throttle processes cleaning dirty pages
> > > > +		 */
> > > > +		wakeup_flusher_threads(nr_dirty);
> > > > +		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > +
> > > > +		/*
> > > > +		 * As lumpy reclaim and memcg targets specific pages, wait on
> > > > +		 * them to be cleaned and try reclaim again.
> > > > +		 */
> > > > +		if (sync_writeback == PAGEOUT_IO_SYNC ||
> > > > +						sc->mem_cgroup != NULL) {
> > > > +			dirty_isolated++;
> > > > +			list_splice(&dirty_pages, page_list);
> > > > +			INIT_LIST_HEAD(&dirty_pages);
> > > > +			goto restart_dirty;
> > > > +		}
> > > > +	}
> > > 
> > > I think it would turn out more natural to just return dirty pages on
> > > page_list and have the whole looping logic in shrink_inactive_list().
> > > 
> > > Mixing dirty pages with other 'please try again' pages is probably not
> > > so bad anyway, it means we could retry all temporary unavailable pages
> > > instead of twiddling thumbs over that particular bunch of pages until
> > > the flushers catch up.
> > > 
> > > What do you think?
> > > 
> > 
> > It's worth considering! It won't be very tidy but it's workable. The reason
> > it is not tidy is that dirty pages and pages that couldn't be paged will be
> > on the same list so they whole lot will need to be recycled. We'd record in
> > scan_control though that there were pages that need to be retried and loop
> > based on that value. That is managable though.
> 
> Recycling all of them is what I had in mind, yeah.  But...
> 
> > The reason why I did it this way was because of lumpy reclaim and memcg
> > requiring specific pages. I considered lumpy reclaim to be the more common
> > case. In that case, it's removing potentially a large number of pages from
> > the LRU that are contiguous. If some of those are dirty and it selects more
> > contiguous ranges for reclaim, I'd worry that lumpy reclaim would trash the
> > system even worse than it currently does when the system is under load. Hence,
> > this wait and retry loop is done instead of returning and isolating more pages.
> 
> I think here we missed each other.  I don't want the loop to be _that_
> much more in the outer scope that isolation is repeated as well. 

My bad.

> What
> I had in mind is the attached patch.  It is not tested and hacked up
> rather quickly due to time constraints, sorry, but you should get the
> idea.  I hope I did not miss anything fundamental.
> 
> Note that since only kswapd enters pageout() anymore, everything
> depending on PAGEOUT_IO_SYNC in there is moot, since there are no sync
> cycles for kswapd.  Just to mitigate the WTF-count on the patch :-)
> 

Anon page writeback can enter pageout. See

static inline bool reclaim_can_writeback(struct scan_control *sc,
                                        struct page *page)
{
        return !page_is_file_cache(page) || current_is_kswapd();
}

So the logic still applies.


> 	Hannes
> 
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -386,21 +386,17 @@ static pageout_t pageout(struct page *pa
>  			ClearPageReclaim(page);
>  			return PAGE_ACTIVATE;
>  		}
> -
> -		/*
> -		 * Wait on writeback if requested to. This happens when
> -		 * direct reclaiming a large contiguous area and the
> -		 * first attempt to free a range of pages fails.
> -		 */
> -		if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
> -			wait_on_page_writeback(page);
> -

I'm assuming this should still remain because it can apply to anon page
writeback (i.e. being swapped)?

>  		if (!PageWriteback(page)) {
>  			/* synchronous write or broken a_ops? */
>  			ClearPageReclaim(page);
>  		}
>  		trace_mm_vmscan_writepage(page,
>  			page_is_file_cache(page),
> +			/*
> +			 * Humm.  Only kswapd comes here and for
> +			 * kswapd there never is a PAGEOUT_IO_SYNC
> +			 * cycle...
> +			 */
>  			sync_writeback == PAGEOUT_IO_SYNC);
>  		inc_zone_page_state(page, NR_VMSCAN_WRITE);

To clarify, see the following example of writeback stats - the anon sync
I/O in particular

Direct reclaim pages scanned                156940     150720     145472 142254 
Direct reclaim write file async I/O           2472          0          0 0 
Direct reclaim write anon async I/O          29281      27195      27968 25519 
Direct reclaim write file sync I/O            1943          0          0 0 
Direct reclaim write anon sync I/O           11777      12488      10835 4806 

>  		return PAGE_SUCCESS;
> @@ -643,12 +639,14 @@ static noinline_for_stack void free_page
>   * shrink_page_list() returns the number of reclaimed pages
>   */
>  static unsigned long shrink_page_list(struct list_head *page_list,
> -					struct scan_control *sc,
> -					enum pageout_io sync_writeback)
> +				      struct scan_control *sc,
> +				      enum pageout_io sync_writeback,
> +				      int *dirty_seen)
>  {
>  	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
>  	int pgactivate = 0;
> +	unsigned long nr_dirty = 0;
>  	unsigned long nr_reclaimed = 0;
>  
>  	cond_resched();
> @@ -657,7 +655,7 @@ static unsigned long shrink_page_list(st
>  		enum page_references references;
>  		struct address_space *mapping;
>  		struct page *page;
> -		int may_enter_fs;
> +		int may_pageout;
>  
>  		cond_resched();
>  
> @@ -681,10 +679,15 @@ static unsigned long shrink_page_list(st
>  		if (page_mapped(page) || PageSwapCache(page))
>  			sc->nr_scanned++;
>  
> -		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> +		/*
> +		 * To prevent stack overflows, only kswapd can enter
> +		 * the filesystem.  Swap IO is always fine (for now).
> +		 */
> +		may_pageout = current_is_kswapd() ||
>  			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
>  

We lost the __GFP_FS check and it's vaguely possible kswapd could call the
allocator with GFP_NOFS. While you check it before wait_on_page_writeback it
needs to be checked before calling pageout(). I toyed around with
creating a may_pageout that took everything into account but I couldn't
convince myself there was no holes or serious change in functionality.

>  		if (PageWriteback(page)) {
> +			int may_wait;
>  			/*
>  			 * Synchronous reclaim is performed in two passes,
>  			 * first an asynchronous pass over the list to
> @@ -693,7 +696,8 @@ static unsigned long shrink_page_list(st
>  			 * for any page for which writeback has already
>  			 * started.
>  			 */
> -			if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
> +			may_wait = (sc->gfp_mask & __GFP_FS) || may_pageout;
> +			if (sync_writeback == PAGEOUT_IO_SYNC && may_wait)
>  				wait_on_page_writeback(page);
>  			else
>  				goto keep_locked;
> @@ -719,7 +723,7 @@ static unsigned long shrink_page_list(st
>  				goto keep_locked;
>  			if (!add_to_swap(page))
>  				goto activate_locked;
> -			may_enter_fs = 1;
> +			may_pageout = 1;
>  		}
>  
>  		mapping = page_mapping(page);
> @@ -742,9 +746,11 @@ static unsigned long shrink_page_list(st
>  		}
>  
>  		if (PageDirty(page)) {
> +			nr_dirty++;
> +
>  			if (references == PAGEREF_RECLAIM_CLEAN)
>  				goto keep_locked;
> -			if (!may_enter_fs)
> +			if (!may_pageout)
>  				goto keep_locked;
>  			if (!sc->may_writepage)
>  				goto keep_locked;
> @@ -860,6 +866,7 @@ keep:
>  
>  	list_splice(&ret_pages, page_list);
>  	count_vm_events(PGACTIVATE, pgactivate);
> +	*dirty_seen = nr_dirty;
>  	return nr_reclaimed;
>  }
>  
> @@ -1232,6 +1239,9 @@ static noinline_for_stack void update_is
>  	reclaim_stat->recent_scanned[1] += *nr_file;
>  }
>  
> +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> +#define MAX_SWAP_CLEAN_WAIT 50
> +
>  /*
>   * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
>   * of reclaimed pages
> @@ -1247,6 +1257,7 @@ shrink_inactive_list(unsigned long nr_to
>  	unsigned long nr_active;
>  	unsigned long nr_anon;
>  	unsigned long nr_file;
> +	unsigned long nr_dirty;
>  
>  	while (unlikely(too_many_isolated(zone, file, sc))) {
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -1295,26 +1306,34 @@ shrink_inactive_list(unsigned long nr_to
>  
>  	spin_unlock_irq(&zone->lru_lock);
>  
> -	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> -
> +	nr_reclaimed = shrink_page_list(&page_list, sc,
> +					PAGEOUT_IO_ASYNC,
> +					&nr_dirty);
>  	/*
>  	 * If we are direct reclaiming for contiguous pages and we do
>  	 * not reclaim everything in the list, try again and wait
>  	 * for IO to complete. This will stall high-order allocations
>  	 * but that should be acceptable to the caller
>  	 */
> -	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> -			sc->lumpy_reclaim_mode) {
> -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> +	if (!current_is_kswapd() && sc->lumpy_reclaim_mode || sc->mem_cgroup) {
> +		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
>  
> -		/*
> -		 * The attempt at page out may have made some
> -		 * of the pages active, mark them inactive again.
> -		 */
> -		nr_active = clear_active_flags(&page_list, NULL);
> -		count_vm_events(PGDEACTIVATE, nr_active);
> +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> +			wakeup_flusher_threads(nr_dirty);
> +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> +			/*
> +			 * The attempt at page out may have made some
> +			 * of the pages active, mark them inactive again.
> +			 *
> +			 * Humm.  Still needed?
> +			 */
> +			nr_active = clear_active_flags(&page_list, NULL);
> +			count_vm_events(PGDEACTIVATE, nr_active);
>  

I don't see why it would be removed.

> -		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> +			nr_reclaimed += shrink_page_list(&page_list, sc,
> +							 PAGEOUT_IO_SYNC,
> +							 &nr_dirty);
> +		}
>  	}
>  
>  	local_irq_disable();

Ok, is this closer to what you had in mind?

==== CUT HERE ====
[PATCH] vmscan: Do not writeback filesystem pages in direct reclaim

When memory is under enough pressure, a process may enter direct
reclaim to free pages in the same manner kswapd does. If a dirty page is
encountered during the scan, this page is written to backing storage using
mapping->writepage. This can result in very deep call stacks, particularly
if the target storage or filesystem are complex. It has already been observed
on XFS that the stack overflows but the problem is not XFS-specific.

This patch prevents direct reclaim writing back filesystem pages by checking
if current is kswapd or the page is anonymous before writing back.  If the
dirty pages cannot be written back, they are placed back on the LRU lists
for either background writing by the BDI threads or kswapd. If in direct
lumpy reclaim and dirty pages are encountered, the process will stall for
the background flusher before trying to reclaim the pages again.

As the call-chain for writing anonymous pages is not expected to be deep
and they are not cleaned by flusher threads, anonymous pages are still
written back in direct reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6587155..e3a5816 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -323,6 +323,51 @@ typedef enum {
 	PAGE_CLEAN,
 } pageout_t;
 
+int write_reclaim_page(struct page *page, struct address_space *mapping,
+						enum pageout_io sync_writeback)
+{
+	int res;
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = SWAP_CLUSTER_MAX,
+		.range_start = 0,
+		.range_end = LLONG_MAX,
+		.nonblocking = 1,
+		.for_reclaim = 1,
+	};
+
+	if (!clear_page_dirty_for_io(page))
+		return PAGE_CLEAN;
+
+	SetPageReclaim(page);
+	res = mapping->a_ops->writepage(page, &wbc);
+	if (res < 0)
+		handle_write_error(mapping, page, res);
+	if (res == AOP_WRITEPAGE_ACTIVATE) {
+		ClearPageReclaim(page);
+		return PAGE_ACTIVATE;
+	}
+
+	/*
+	 * Wait on writeback if requested to. This happens when
+	 * direct reclaiming a large contiguous area and the
+	 * first attempt to free a range of pages fails.
+	 */
+	if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
+		wait_on_page_writeback(page);
+
+	if (!PageWriteback(page)) {
+		/* synchronous write or broken a_ops? */
+		ClearPageReclaim(page);
+	}
+	trace_mm_vmscan_writepage(page,
+		page_is_file_cache(page),
+		sync_writeback == PAGEOUT_IO_SYNC);
+	inc_zone_page_state(page, NR_VMSCAN_WRITE);
+
+	return PAGE_SUCCESS;
+}
+
 /*
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
@@ -367,46 +412,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 	if (!may_write_to_queue(mapping->backing_dev_info))
 		return PAGE_KEEP;
 
-	if (clear_page_dirty_for_io(page)) {
-		int res;
-		struct writeback_control wbc = {
-			.sync_mode = WB_SYNC_NONE,
-			.nr_to_write = SWAP_CLUSTER_MAX,
-			.range_start = 0,
-			.range_end = LLONG_MAX,
-			.nonblocking = 1,
-			.for_reclaim = 1,
-		};
-
-		SetPageReclaim(page);
-		res = mapping->a_ops->writepage(page, &wbc);
-		if (res < 0)
-			handle_write_error(mapping, page, res);
-		if (res == AOP_WRITEPAGE_ACTIVATE) {
-			ClearPageReclaim(page);
-			return PAGE_ACTIVATE;
-		}
-
-		/*
-		 * Wait on writeback if requested to. This happens when
-		 * direct reclaiming a large contiguous area and the
-		 * first attempt to free a range of pages fails.
-		 */
-		if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
-			wait_on_page_writeback(page);
-
-		if (!PageWriteback(page)) {
-			/* synchronous write or broken a_ops? */
-			ClearPageReclaim(page);
-		}
-		trace_mm_vmscan_writepage(page,
-			page_is_file_cache(page),
-			sync_writeback == PAGEOUT_IO_SYNC);
-		inc_zone_page_state(page, NR_VMSCAN_WRITE);
-		return PAGE_SUCCESS;
-	}
-
-	return PAGE_CLEAN;
+	return write_reclaim_page(page, mapping, sync_writeback);
 }
 
 /*
@@ -639,18 +645,25 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
 	pagevec_free(&freed_pvec);
 }
 
+/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
+#define MAX_SWAP_CLEAN_WAIT 50
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
 					struct scan_control *sc,
-					enum pageout_io sync_writeback)
+					enum pageout_io sync_writeback,
+					unsigned long *nr_still_dirty)
 {
-	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
-	int pgactivate = 0;
+	LIST_HEAD(putback_pages);
+	LIST_HEAD(dirty_pages);
+	int pgactivate;
+	unsigned long nr_dirty = 0;
 	unsigned long nr_reclaimed = 0;
 
+	pgactivate = 0;
 	cond_resched();
 
 	while (!list_empty(page_list)) {
@@ -741,7 +754,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (PageDirty(page)) {
+		if (PageDirty(page))  {
+			/*
+			 * Only kswapd can writeback filesystem pages to
+			 * avoid risk of stack overflow
+			 */
+			if (page_is_file_cache(page) && !current_is_kswapd()) {
+				list_add(&page->lru, &dirty_pages);
+				unlock_page(page);
+				nr_dirty++;
+				goto keep_dirty;
+			}
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
@@ -852,13 +876,19 @@ activate_locked:
 keep_locked:
 		unlock_page(page);
 keep:
-		list_add(&page->lru, &ret_pages);
+		list_add(&page->lru, &putback_pages);
+keep_dirty:
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
 
 	free_page_list(&free_pages);
 
-	list_splice(&ret_pages, page_list);
+	if (nr_dirty) {
+		*nr_still_dirty = nr_dirty;
+		list_splice(&dirty_pages, page_list);
+	}
+	list_splice(&putback_pages, page_list);
+
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
@@ -1245,6 +1275,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	unsigned long nr_active;
 	unsigned long nr_anon;
 	unsigned long nr_file;
+	unsigned long nr_dirty;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1293,26 +1324,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
+								&nr_dirty);
 
 	/*
-	 * If we are direct reclaiming for contiguous pages and we do
+	 * If specific pages are needed such as with direct reclaiming
+	 * for contiguous pages or for memory containers and we do
 	 * not reclaim everything in the list, try again and wait
-	 * for IO to complete. This will stall high-order allocations
-	 * but that should be acceptable to the caller
+	 * for IO to complete. This will stall callers that require
+	 * specific pages but it should be acceptable to the caller
 	 */
-	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
-			sc->lumpy_reclaim_mode) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+	if (sc->may_writepage && !current_is_kswapd() &&
+			(sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
+		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
 
-		/*
-		 * The attempt at page out may have made some
-		 * of the pages active, mark them inactive again.
-		 */
-		nr_active = clear_active_flags(&page_list, NULL);
-		count_vm_events(PGDEACTIVATE, nr_active);
+		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
+			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
+			congestion_wait(BLK_RW_ASYNC, HZ/10);
 
-		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+			/*
+			 * The attempt at page out may have made some
+			 * of the pages active, mark them inactive again.
+			 */
+			nr_active = clear_active_flags(&page_list, NULL);
+			count_vm_events(PGDEACTIVATE, nr_active);
+	
+			nr_reclaimed += shrink_page_list(&page_list, sc,
+						PAGEOUT_IO_SYNC, &nr_dirty);
+		}
 	}
 
 	local_irq_disable();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-21 11:52           ` Mel Gorman
@ 2010-07-21 12:01             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 177+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-21 12:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Christoph Hellwig, Wu Fengguang, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, 21 Jul 2010 12:52:50 +0100
Mel Gorman <mel@csn.ul.ie> wrote:


> ==== CUT HERE ====
> [PATCH] vmscan: Do not writeback filesystem pages in direct reclaim
> 
> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
> 
> This patch prevents direct reclaim writing back filesystem pages by checking
> if current is kswapd or the page is anonymous before writing back.  If the
> dirty pages cannot be written back, they are placed back on the LRU lists
> for either background writing by the BDI threads or kswapd. If in direct
> lumpy reclaim and dirty pages are encountered, the process will stall for
> the background flusher before trying to reclaim the pages again.
> 
> As the call-chain for writing anonymous pages is not expected to be deep
> and they are not cleaned by flusher threads, anonymous pages are still
> written back in direct reclaim.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6587155..e3a5816 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -323,6 +323,51 @@ typedef enum {
>  	PAGE_CLEAN,
>  } pageout_t;
>  
> +int write_reclaim_page(struct page *page, struct address_space *mapping,
> +						enum pageout_io sync_writeback)
> +{
> +	int res;
> +	struct writeback_control wbc = {
> +		.sync_mode = WB_SYNC_NONE,
> +		.nr_to_write = SWAP_CLUSTER_MAX,
> +		.range_start = 0,
> +		.range_end = LLONG_MAX,
> +		.nonblocking = 1,
> +		.for_reclaim = 1,
> +	};
> +
> +	if (!clear_page_dirty_for_io(page))
> +		return PAGE_CLEAN;
> +
> +	SetPageReclaim(page);
> +	res = mapping->a_ops->writepage(page, &wbc);
> +	if (res < 0)
> +		handle_write_error(mapping, page, res);
> +	if (res == AOP_WRITEPAGE_ACTIVATE) {
> +		ClearPageReclaim(page);
> +		return PAGE_ACTIVATE;
> +	}
> +
> +	/*
> +	 * Wait on writeback if requested to. This happens when
> +	 * direct reclaiming a large contiguous area and the
> +	 * first attempt to free a range of pages fails.
> +	 */
> +	if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
> +		wait_on_page_writeback(page);
> +
> +	if (!PageWriteback(page)) {
> +		/* synchronous write or broken a_ops? */
> +		ClearPageReclaim(page);
> +	}
> +	trace_mm_vmscan_writepage(page,
> +		page_is_file_cache(page),
> +		sync_writeback == PAGEOUT_IO_SYNC);
> +	inc_zone_page_state(page, NR_VMSCAN_WRITE);
> +
> +	return PAGE_SUCCESS;
> +}
> +
>  /*
>   * pageout is called by shrink_page_list() for each dirty page.
>   * Calls ->writepage().
> @@ -367,46 +412,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
>  	if (!may_write_to_queue(mapping->backing_dev_info))
>  		return PAGE_KEEP;
>  
> -	if (clear_page_dirty_for_io(page)) {
> -		int res;
> -		struct writeback_control wbc = {
> -			.sync_mode = WB_SYNC_NONE,
> -			.nr_to_write = SWAP_CLUSTER_MAX,
> -			.range_start = 0,
> -			.range_end = LLONG_MAX,
> -			.nonblocking = 1,
> -			.for_reclaim = 1,
> -		};
> -
> -		SetPageReclaim(page);
> -		res = mapping->a_ops->writepage(page, &wbc);
> -		if (res < 0)
> -			handle_write_error(mapping, page, res);
> -		if (res == AOP_WRITEPAGE_ACTIVATE) {
> -			ClearPageReclaim(page);
> -			return PAGE_ACTIVATE;
> -		}
> -
> -		/*
> -		 * Wait on writeback if requested to. This happens when
> -		 * direct reclaiming a large contiguous area and the
> -		 * first attempt to free a range of pages fails.
> -		 */
> -		if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
> -			wait_on_page_writeback(page);
> -
> -		if (!PageWriteback(page)) {
> -			/* synchronous write or broken a_ops? */
> -			ClearPageReclaim(page);
> -		}
> -		trace_mm_vmscan_writepage(page,
> -			page_is_file_cache(page),
> -			sync_writeback == PAGEOUT_IO_SYNC);
> -		inc_zone_page_state(page, NR_VMSCAN_WRITE);
> -		return PAGE_SUCCESS;
> -	}
> -
> -	return PAGE_CLEAN;
> +	return write_reclaim_page(page, mapping, sync_writeback);
>  }
>  
>  /*
> @@ -639,18 +645,25 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
>  	pagevec_free(&freed_pvec);
>  }
>  
> +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> +#define MAX_SWAP_CLEAN_WAIT 50
> +
>  /*
>   * shrink_page_list() returns the number of reclaimed pages
>   */
>  static unsigned long shrink_page_list(struct list_head *page_list,
>  					struct scan_control *sc,
> -					enum pageout_io sync_writeback)
> +					enum pageout_io sync_writeback,
> +					unsigned long *nr_still_dirty)
>  {
> -	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
> -	int pgactivate = 0;
> +	LIST_HEAD(putback_pages);
> +	LIST_HEAD(dirty_pages);
> +	int pgactivate;
> +	unsigned long nr_dirty = 0;
>  	unsigned long nr_reclaimed = 0;
>  
> +	pgactivate = 0;
>  	cond_resched();
>  
>  	while (!list_empty(page_list)) {
> @@ -741,7 +754,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			}
>  		}
>  
> -		if (PageDirty(page)) {
> +		if (PageDirty(page))  {
> +			/*
> +			 * Only kswapd can writeback filesystem pages to
> +			 * avoid risk of stack overflow
> +			 */
> +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> +				list_add(&page->lru, &dirty_pages);
> +				unlock_page(page);
> +				nr_dirty++;
> +				goto keep_dirty;
> +			}
> +
>  			if (references == PAGEREF_RECLAIM_CLEAN)
>  				goto keep_locked;
>  			if (!may_enter_fs)
> @@ -852,13 +876,19 @@ activate_locked:
>  keep_locked:
>  		unlock_page(page);
>  keep:
> -		list_add(&page->lru, &ret_pages);
> +		list_add(&page->lru, &putback_pages);
> +keep_dirty:
>  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
>  	}
>  
>  	free_page_list(&free_pages);
>  
> -	list_splice(&ret_pages, page_list);
> +	if (nr_dirty) {
> +		*nr_still_dirty = nr_dirty;
> +		list_splice(&dirty_pages, page_list);
> +	}
> +	list_splice(&putback_pages, page_list);
> +
>  	count_vm_events(PGACTIVATE, pgactivate);
>  	return nr_reclaimed;
>  }
> @@ -1245,6 +1275,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  	unsigned long nr_active;
>  	unsigned long nr_anon;
>  	unsigned long nr_file;
> +	unsigned long nr_dirty;
>  
>  	while (unlikely(too_many_isolated(zone, file, sc))) {
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -1293,26 +1324,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  
>  	spin_unlock_irq(&zone->lru_lock);
>  
> -	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> +	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> +								&nr_dirty);
>  
>  	/*
> -	 * If we are direct reclaiming for contiguous pages and we do
> +	 * If specific pages are needed such as with direct reclaiming
> +	 * for contiguous pages or for memory containers and we do
>  	 * not reclaim everything in the list, try again and wait
> -	 * for IO to complete. This will stall high-order allocations
> -	 * but that should be acceptable to the caller
> +	 * for IO to complete. This will stall callers that require
> +	 * specific pages but it should be acceptable to the caller
>  	 */
> -	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> -			sc->lumpy_reclaim_mode) {
> -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> +	if (sc->may_writepage && !current_is_kswapd() &&
> +			(sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> +		int dirty_retry = MAX_SWAP_CLEAN_WAIT;

Hmm, ok. I see what will happen to memcg.
But, hmm, memcg will have to select to enter this rounine based on
the result of 1st memory reclaim.

>  
> -		/*
> -		 * The attempt at page out may have made some
> -		 * of the pages active, mark them inactive again.
> -		 */
> -		nr_active = clear_active_flags(&page_list, NULL);
> -		count_vm_events(PGDEACTIVATE, nr_active);
> +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> +			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> +			congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
Congestion wait is required ?? Where the congestion happens ?
I'm sorry you already have some other trick in other patch.

> -		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> +			/*
> +			 * The attempt at page out may have made some
> +			 * of the pages active, mark them inactive again.
> +			 */
> +			nr_active = clear_active_flags(&page_list, NULL);
> +			count_vm_events(PGDEACTIVATE, nr_active);
> +	
> +			nr_reclaimed += shrink_page_list(&page_list, sc,
> +						PAGEOUT_IO_SYNC, &nr_dirty);
> +		}

Just a question. This PAGEOUT_IO_SYNC has some meanings ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-21 12:01             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 177+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-21 12:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Christoph Hellwig, Wu Fengguang, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, 21 Jul 2010 12:52:50 +0100
Mel Gorman <mel@csn.ul.ie> wrote:


> ==== CUT HERE ====
> [PATCH] vmscan: Do not writeback filesystem pages in direct reclaim
> 
> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
> 
> This patch prevents direct reclaim writing back filesystem pages by checking
> if current is kswapd or the page is anonymous before writing back.  If the
> dirty pages cannot be written back, they are placed back on the LRU lists
> for either background writing by the BDI threads or kswapd. If in direct
> lumpy reclaim and dirty pages are encountered, the process will stall for
> the background flusher before trying to reclaim the pages again.
> 
> As the call-chain for writing anonymous pages is not expected to be deep
> and they are not cleaned by flusher threads, anonymous pages are still
> written back in direct reclaim.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6587155..e3a5816 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -323,6 +323,51 @@ typedef enum {
>  	PAGE_CLEAN,
>  } pageout_t;
>  
> +int write_reclaim_page(struct page *page, struct address_space *mapping,
> +						enum pageout_io sync_writeback)
> +{
> +	int res;
> +	struct writeback_control wbc = {
> +		.sync_mode = WB_SYNC_NONE,
> +		.nr_to_write = SWAP_CLUSTER_MAX,
> +		.range_start = 0,
> +		.range_end = LLONG_MAX,
> +		.nonblocking = 1,
> +		.for_reclaim = 1,
> +	};
> +
> +	if (!clear_page_dirty_for_io(page))
> +		return PAGE_CLEAN;
> +
> +	SetPageReclaim(page);
> +	res = mapping->a_ops->writepage(page, &wbc);
> +	if (res < 0)
> +		handle_write_error(mapping, page, res);
> +	if (res == AOP_WRITEPAGE_ACTIVATE) {
> +		ClearPageReclaim(page);
> +		return PAGE_ACTIVATE;
> +	}
> +
> +	/*
> +	 * Wait on writeback if requested to. This happens when
> +	 * direct reclaiming a large contiguous area and the
> +	 * first attempt to free a range of pages fails.
> +	 */
> +	if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
> +		wait_on_page_writeback(page);
> +
> +	if (!PageWriteback(page)) {
> +		/* synchronous write or broken a_ops? */
> +		ClearPageReclaim(page);
> +	}
> +	trace_mm_vmscan_writepage(page,
> +		page_is_file_cache(page),
> +		sync_writeback == PAGEOUT_IO_SYNC);
> +	inc_zone_page_state(page, NR_VMSCAN_WRITE);
> +
> +	return PAGE_SUCCESS;
> +}
> +
>  /*
>   * pageout is called by shrink_page_list() for each dirty page.
>   * Calls ->writepage().
> @@ -367,46 +412,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
>  	if (!may_write_to_queue(mapping->backing_dev_info))
>  		return PAGE_KEEP;
>  
> -	if (clear_page_dirty_for_io(page)) {
> -		int res;
> -		struct writeback_control wbc = {
> -			.sync_mode = WB_SYNC_NONE,
> -			.nr_to_write = SWAP_CLUSTER_MAX,
> -			.range_start = 0,
> -			.range_end = LLONG_MAX,
> -			.nonblocking = 1,
> -			.for_reclaim = 1,
> -		};
> -
> -		SetPageReclaim(page);
> -		res = mapping->a_ops->writepage(page, &wbc);
> -		if (res < 0)
> -			handle_write_error(mapping, page, res);
> -		if (res == AOP_WRITEPAGE_ACTIVATE) {
> -			ClearPageReclaim(page);
> -			return PAGE_ACTIVATE;
> -		}
> -
> -		/*
> -		 * Wait on writeback if requested to. This happens when
> -		 * direct reclaiming a large contiguous area and the
> -		 * first attempt to free a range of pages fails.
> -		 */
> -		if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
> -			wait_on_page_writeback(page);
> -
> -		if (!PageWriteback(page)) {
> -			/* synchronous write or broken a_ops? */
> -			ClearPageReclaim(page);
> -		}
> -		trace_mm_vmscan_writepage(page,
> -			page_is_file_cache(page),
> -			sync_writeback == PAGEOUT_IO_SYNC);
> -		inc_zone_page_state(page, NR_VMSCAN_WRITE);
> -		return PAGE_SUCCESS;
> -	}
> -
> -	return PAGE_CLEAN;
> +	return write_reclaim_page(page, mapping, sync_writeback);
>  }
>  
>  /*
> @@ -639,18 +645,25 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
>  	pagevec_free(&freed_pvec);
>  }
>  
> +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> +#define MAX_SWAP_CLEAN_WAIT 50
> +
>  /*
>   * shrink_page_list() returns the number of reclaimed pages
>   */
>  static unsigned long shrink_page_list(struct list_head *page_list,
>  					struct scan_control *sc,
> -					enum pageout_io sync_writeback)
> +					enum pageout_io sync_writeback,
> +					unsigned long *nr_still_dirty)
>  {
> -	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
> -	int pgactivate = 0;
> +	LIST_HEAD(putback_pages);
> +	LIST_HEAD(dirty_pages);
> +	int pgactivate;
> +	unsigned long nr_dirty = 0;
>  	unsigned long nr_reclaimed = 0;
>  
> +	pgactivate = 0;
>  	cond_resched();
>  
>  	while (!list_empty(page_list)) {
> @@ -741,7 +754,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			}
>  		}
>  
> -		if (PageDirty(page)) {
> +		if (PageDirty(page))  {
> +			/*
> +			 * Only kswapd can writeback filesystem pages to
> +			 * avoid risk of stack overflow
> +			 */
> +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> +				list_add(&page->lru, &dirty_pages);
> +				unlock_page(page);
> +				nr_dirty++;
> +				goto keep_dirty;
> +			}
> +
>  			if (references == PAGEREF_RECLAIM_CLEAN)
>  				goto keep_locked;
>  			if (!may_enter_fs)
> @@ -852,13 +876,19 @@ activate_locked:
>  keep_locked:
>  		unlock_page(page);
>  keep:
> -		list_add(&page->lru, &ret_pages);
> +		list_add(&page->lru, &putback_pages);
> +keep_dirty:
>  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
>  	}
>  
>  	free_page_list(&free_pages);
>  
> -	list_splice(&ret_pages, page_list);
> +	if (nr_dirty) {
> +		*nr_still_dirty = nr_dirty;
> +		list_splice(&dirty_pages, page_list);
> +	}
> +	list_splice(&putback_pages, page_list);
> +
>  	count_vm_events(PGACTIVATE, pgactivate);
>  	return nr_reclaimed;
>  }
> @@ -1245,6 +1275,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  	unsigned long nr_active;
>  	unsigned long nr_anon;
>  	unsigned long nr_file;
> +	unsigned long nr_dirty;
>  
>  	while (unlikely(too_many_isolated(zone, file, sc))) {
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -1293,26 +1324,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  
>  	spin_unlock_irq(&zone->lru_lock);
>  
> -	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> +	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> +								&nr_dirty);
>  
>  	/*
> -	 * If we are direct reclaiming for contiguous pages and we do
> +	 * If specific pages are needed such as with direct reclaiming
> +	 * for contiguous pages or for memory containers and we do
>  	 * not reclaim everything in the list, try again and wait
> -	 * for IO to complete. This will stall high-order allocations
> -	 * but that should be acceptable to the caller
> +	 * for IO to complete. This will stall callers that require
> +	 * specific pages but it should be acceptable to the caller
>  	 */
> -	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> -			sc->lumpy_reclaim_mode) {
> -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> +	if (sc->may_writepage && !current_is_kswapd() &&
> +			(sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> +		int dirty_retry = MAX_SWAP_CLEAN_WAIT;

Hmm, ok. I see what will happen to memcg.
But, hmm, memcg will have to select to enter this rounine based on
the result of 1st memory reclaim.

>  
> -		/*
> -		 * The attempt at page out may have made some
> -		 * of the pages active, mark them inactive again.
> -		 */
> -		nr_active = clear_active_flags(&page_list, NULL);
> -		count_vm_events(PGDEACTIVATE, nr_active);
> +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> +			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> +			congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
Congestion wait is required ?? Where the congestion happens ?
I'm sorry you already have some other trick in other patch.

> -		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> +			/*
> +			 * The attempt at page out may have made some
> +			 * of the pages active, mark them inactive again.
> +			 */
> +			nr_active = clear_active_flags(&page_list, NULL);
> +			count_vm_events(PGDEACTIVATE, nr_active);
> +	
> +			nr_reclaimed += shrink_page_list(&page_list, sc,
> +						PAGEOUT_IO_SYNC, &nr_dirty);
> +		}

Just a question. This PAGEOUT_IO_SYNC has some meanings ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-21 11:52           ` Mel Gorman
@ 2010-07-21 13:04             ` Johannes Weiner
  -1 siblings, 0 replies; 177+ messages in thread
From: Johannes Weiner @ 2010-07-21 13:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, Jul 21, 2010 at 12:52:50PM +0100, Mel Gorman wrote:
> On Wed, Jul 21, 2010 at 12:02:18AM +0200, Johannes Weiner wrote:
> > What
> > I had in mind is the attached patch.  It is not tested and hacked up
> > rather quickly due to time constraints, sorry, but you should get the
> > idea.  I hope I did not miss anything fundamental.
> > 
> > Note that since only kswapd enters pageout() anymore, everything
> > depending on PAGEOUT_IO_SYNC in there is moot, since there are no sync
> > cycles for kswapd.  Just to mitigate the WTF-count on the patch :-)
> > 
> 
> Anon page writeback can enter pageout. See
> 
> static inline bool reclaim_can_writeback(struct scan_control *sc,
>                                         struct page *page)
> {
>         return !page_is_file_cache(page) || current_is_kswapd();
> }
> 
> So the logic still applies.

Yeah, I noticed it only after looking at it again this morning.  My
bad, it got a bit late when I wrote it.

> > @@ -643,12 +639,14 @@ static noinline_for_stack void free_page
> >   * shrink_page_list() returns the number of reclaimed pages
> >   */
> >  static unsigned long shrink_page_list(struct list_head *page_list,
> > -					struct scan_control *sc,
> > -					enum pageout_io sync_writeback)
> > +				      struct scan_control *sc,
> > +				      enum pageout_io sync_writeback,
> > +				      int *dirty_seen)
> >  {
> >  	LIST_HEAD(ret_pages);
> >  	LIST_HEAD(free_pages);
> >  	int pgactivate = 0;
> > +	unsigned long nr_dirty = 0;
> >  	unsigned long nr_reclaimed = 0;
> >  
> >  	cond_resched();
> > @@ -657,7 +655,7 @@ static unsigned long shrink_page_list(st
> >  		enum page_references references;
> >  		struct address_space *mapping;
> >  		struct page *page;
> > -		int may_enter_fs;
> > +		int may_pageout;
> >  
> >  		cond_resched();
> >  
> > @@ -681,10 +679,15 @@ static unsigned long shrink_page_list(st
> >  		if (page_mapped(page) || PageSwapCache(page))
> >  			sc->nr_scanned++;
> >  
> > -		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> > +		/*
> > +		 * To prevent stack overflows, only kswapd can enter
> > +		 * the filesystem.  Swap IO is always fine (for now).
> > +		 */
> > +		may_pageout = current_is_kswapd() ||
> >  			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
> >  
> 
> We lost the __GFP_FS check and it's vaguely possible kswapd could call the
> allocator with GFP_NOFS. While you check it before wait_on_page_writeback it
> needs to be checked before calling pageout(). I toyed around with
> creating a may_pageout that took everything into account but I couldn't
> convince myself there was no holes or serious change in functionality.

Yeah, I checked balance_pgdat(), saw GFP_KERNEL and went for it.  But
it's probably better to keep such dependencies out.

> Ok, is this closer to what you had in mind?

IMHO this is (almost) ready to get merged, so I am including the
nitpicking comments :-)

> ==== CUT HERE ====
> [PATCH] vmscan: Do not writeback filesystem pages in direct reclaim
> 
> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
> 
> This patch prevents direct reclaim writing back filesystem pages by checking
> if current is kswapd or the page is anonymous before writing back.  If the
> dirty pages cannot be written back, they are placed back on the LRU lists
> for either background writing by the BDI threads or kswapd. If in direct
> lumpy reclaim and dirty pages are encountered, the process will stall for
> the background flusher before trying to reclaim the pages again.
> 
> As the call-chain for writing anonymous pages is not expected to be deep
> and they are not cleaned by flusher threads, anonymous pages are still
> written back in direct reclaim.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6587155..e3a5816 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c

[...]

Does factoring pageout() still make sense in this patch?  It does not
introduce a second callsite.

> @@ -639,18 +645,25 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
>  	pagevec_free(&freed_pvec);
>  }
>  
> +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> +#define MAX_SWAP_CLEAN_WAIT 50

That's placed a bit randomly now that shrink_page_list() doesn't use
it anymore.  I moved it just above shrink_inactive_list() but maybe it
would be better at the file's head?

>  /*
>   * shrink_page_list() returns the number of reclaimed pages
>   */
>  static unsigned long shrink_page_list(struct list_head *page_list,
>  					struct scan_control *sc,
> -					enum pageout_io sync_writeback)
> +					enum pageout_io sync_writeback,
> +					unsigned long *nr_still_dirty)
>  {
> -	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
> -	int pgactivate = 0;
> +	LIST_HEAD(putback_pages);
> +	LIST_HEAD(dirty_pages);
> +	int pgactivate;
> +	unsigned long nr_dirty = 0;
>  	unsigned long nr_reclaimed = 0;
>  
> +	pgactivate = 0;

Spurious change?

>  	cond_resched();
>  
>  	while (!list_empty(page_list)) {
> @@ -741,7 +754,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			}
>  		}
>  
> -		if (PageDirty(page)) {
> +		if (PageDirty(page))  {

Ha!

> +			/*
> +			 * Only kswapd can writeback filesystem pages to
> +			 * avoid risk of stack overflow
> +			 */
> +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> +				list_add(&page->lru, &dirty_pages);
> +				unlock_page(page);
> +				nr_dirty++;
> +				goto keep_dirty;
> +			}

I don't understand why you keep the extra dirty list.  Couldn't this
just be `goto keep_locked'?

>  			if (references == PAGEREF_RECLAIM_CLEAN)
>  				goto keep_locked;
>  			if (!may_enter_fs)
> @@ -852,13 +876,19 @@ activate_locked:
>  keep_locked:
>  		unlock_page(page);
>  keep:
> -		list_add(&page->lru, &ret_pages);
> +		list_add(&page->lru, &putback_pages);
> +keep_dirty:
>  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
>  	}
>  
>  	free_page_list(&free_pages);
>  
> -	list_splice(&ret_pages, page_list);
> +	if (nr_dirty) {
> +		*nr_still_dirty = nr_dirty;

You either have to set *nr_still_dirty unconditionally or
(re)initialize the variable in shrink_inactive_list().

> +		list_splice(&dirty_pages, page_list);
> +	}
> +	list_splice(&putback_pages, page_list);

When we retry those pages, the dirty ones come last on the list.  Was
this maybe the intention behind collecting dirties separately?

> @@ -1245,6 +1275,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  	unsigned long nr_active;
>  	unsigned long nr_anon;
>  	unsigned long nr_file;
> +	unsigned long nr_dirty;
>  
>  	while (unlikely(too_many_isolated(zone, file, sc))) {
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -1293,26 +1324,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  
>  	spin_unlock_irq(&zone->lru_lock);
>  
> -	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> +	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> +								&nr_dirty);
>  
>  	/*
> -	 * If we are direct reclaiming for contiguous pages and we do
> +	 * If specific pages are needed such as with direct reclaiming
> +	 * for contiguous pages or for memory containers and we do
>  	 * not reclaim everything in the list, try again and wait
> -	 * for IO to complete. This will stall high-order allocations
> -	 * but that should be acceptable to the caller
> +	 * for IO to complete. This will stall callers that require
> +	 * specific pages but it should be acceptable to the caller
>  	 */
> -	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> -			sc->lumpy_reclaim_mode) {
> -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> +	if (sc->may_writepage && !current_is_kswapd() &&
> +			(sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> +		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
>  
> -		/*
> -		 * The attempt at page out may have made some
> -		 * of the pages active, mark them inactive again.
> -		 */
> -		nr_active = clear_active_flags(&page_list, NULL);
> -		count_vm_events(PGDEACTIVATE, nr_active);
> +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> +			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);

Yup, minding laptop_mode (together with may_writepage).  Agreed.

> +			congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
> -		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> +			/*
> +			 * The attempt at page out may have made some
> +			 * of the pages active, mark them inactive again.
> +			 */
> +			nr_active = clear_active_flags(&page_list, NULL);
> +			count_vm_events(PGDEACTIVATE, nr_active);
> +	
> +			nr_reclaimed += shrink_page_list(&page_list, sc,
> +						PAGEOUT_IO_SYNC, &nr_dirty);
> +		}
>  	}
>  
>  	local_irq_disable();

Thanks,
	Hannes

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-21 13:04             ` Johannes Weiner
  0 siblings, 0 replies; 177+ messages in thread
From: Johannes Weiner @ 2010-07-21 13:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, Jul 21, 2010 at 12:52:50PM +0100, Mel Gorman wrote:
> On Wed, Jul 21, 2010 at 12:02:18AM +0200, Johannes Weiner wrote:
> > What
> > I had in mind is the attached patch.  It is not tested and hacked up
> > rather quickly due to time constraints, sorry, but you should get the
> > idea.  I hope I did not miss anything fundamental.
> > 
> > Note that since only kswapd enters pageout() anymore, everything
> > depending on PAGEOUT_IO_SYNC in there is moot, since there are no sync
> > cycles for kswapd.  Just to mitigate the WTF-count on the patch :-)
> > 
> 
> Anon page writeback can enter pageout. See
> 
> static inline bool reclaim_can_writeback(struct scan_control *sc,
>                                         struct page *page)
> {
>         return !page_is_file_cache(page) || current_is_kswapd();
> }
> 
> So the logic still applies.

Yeah, I noticed it only after looking at it again this morning.  My
bad, it got a bit late when I wrote it.

> > @@ -643,12 +639,14 @@ static noinline_for_stack void free_page
> >   * shrink_page_list() returns the number of reclaimed pages
> >   */
> >  static unsigned long shrink_page_list(struct list_head *page_list,
> > -					struct scan_control *sc,
> > -					enum pageout_io sync_writeback)
> > +				      struct scan_control *sc,
> > +				      enum pageout_io sync_writeback,
> > +				      int *dirty_seen)
> >  {
> >  	LIST_HEAD(ret_pages);
> >  	LIST_HEAD(free_pages);
> >  	int pgactivate = 0;
> > +	unsigned long nr_dirty = 0;
> >  	unsigned long nr_reclaimed = 0;
> >  
> >  	cond_resched();
> > @@ -657,7 +655,7 @@ static unsigned long shrink_page_list(st
> >  		enum page_references references;
> >  		struct address_space *mapping;
> >  		struct page *page;
> > -		int may_enter_fs;
> > +		int may_pageout;
> >  
> >  		cond_resched();
> >  
> > @@ -681,10 +679,15 @@ static unsigned long shrink_page_list(st
> >  		if (page_mapped(page) || PageSwapCache(page))
> >  			sc->nr_scanned++;
> >  
> > -		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> > +		/*
> > +		 * To prevent stack overflows, only kswapd can enter
> > +		 * the filesystem.  Swap IO is always fine (for now).
> > +		 */
> > +		may_pageout = current_is_kswapd() ||
> >  			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
> >  
> 
> We lost the __GFP_FS check and it's vaguely possible kswapd could call the
> allocator with GFP_NOFS. While you check it before wait_on_page_writeback it
> needs to be checked before calling pageout(). I toyed around with
> creating a may_pageout that took everything into account but I couldn't
> convince myself there was no holes or serious change in functionality.

Yeah, I checked balance_pgdat(), saw GFP_KERNEL and went for it.  But
it's probably better to keep such dependencies out.

> Ok, is this closer to what you had in mind?

IMHO this is (almost) ready to get merged, so I am including the
nitpicking comments :-)

> ==== CUT HERE ====
> [PATCH] vmscan: Do not writeback filesystem pages in direct reclaim
> 
> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
> 
> This patch prevents direct reclaim writing back filesystem pages by checking
> if current is kswapd or the page is anonymous before writing back.  If the
> dirty pages cannot be written back, they are placed back on the LRU lists
> for either background writing by the BDI threads or kswapd. If in direct
> lumpy reclaim and dirty pages are encountered, the process will stall for
> the background flusher before trying to reclaim the pages again.
> 
> As the call-chain for writing anonymous pages is not expected to be deep
> and they are not cleaned by flusher threads, anonymous pages are still
> written back in direct reclaim.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6587155..e3a5816 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c

[...]

Does factoring pageout() still make sense in this patch?  It does not
introduce a second callsite.

> @@ -639,18 +645,25 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
>  	pagevec_free(&freed_pvec);
>  }
>  
> +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> +#define MAX_SWAP_CLEAN_WAIT 50

That's placed a bit randomly now that shrink_page_list() doesn't use
it anymore.  I moved it just above shrink_inactive_list() but maybe it
would be better at the file's head?

>  /*
>   * shrink_page_list() returns the number of reclaimed pages
>   */
>  static unsigned long shrink_page_list(struct list_head *page_list,
>  					struct scan_control *sc,
> -					enum pageout_io sync_writeback)
> +					enum pageout_io sync_writeback,
> +					unsigned long *nr_still_dirty)
>  {
> -	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
> -	int pgactivate = 0;
> +	LIST_HEAD(putback_pages);
> +	LIST_HEAD(dirty_pages);
> +	int pgactivate;
> +	unsigned long nr_dirty = 0;
>  	unsigned long nr_reclaimed = 0;
>  
> +	pgactivate = 0;

Spurious change?

>  	cond_resched();
>  
>  	while (!list_empty(page_list)) {
> @@ -741,7 +754,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			}
>  		}
>  
> -		if (PageDirty(page)) {
> +		if (PageDirty(page))  {

Ha!

> +			/*
> +			 * Only kswapd can writeback filesystem pages to
> +			 * avoid risk of stack overflow
> +			 */
> +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> +				list_add(&page->lru, &dirty_pages);
> +				unlock_page(page);
> +				nr_dirty++;
> +				goto keep_dirty;
> +			}

I don't understand why you keep the extra dirty list.  Couldn't this
just be `goto keep_locked'?

>  			if (references == PAGEREF_RECLAIM_CLEAN)
>  				goto keep_locked;
>  			if (!may_enter_fs)
> @@ -852,13 +876,19 @@ activate_locked:
>  keep_locked:
>  		unlock_page(page);
>  keep:
> -		list_add(&page->lru, &ret_pages);
> +		list_add(&page->lru, &putback_pages);
> +keep_dirty:
>  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
>  	}
>  
>  	free_page_list(&free_pages);
>  
> -	list_splice(&ret_pages, page_list);
> +	if (nr_dirty) {
> +		*nr_still_dirty = nr_dirty;

You either have to set *nr_still_dirty unconditionally or
(re)initialize the variable in shrink_inactive_list().

> +		list_splice(&dirty_pages, page_list);
> +	}
> +	list_splice(&putback_pages, page_list);

When we retry those pages, the dirty ones come last on the list.  Was
this maybe the intention behind collecting dirties separately?

> @@ -1245,6 +1275,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  	unsigned long nr_active;
>  	unsigned long nr_anon;
>  	unsigned long nr_file;
> +	unsigned long nr_dirty;
>  
>  	while (unlikely(too_many_isolated(zone, file, sc))) {
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -1293,26 +1324,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  
>  	spin_unlock_irq(&zone->lru_lock);
>  
> -	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> +	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> +								&nr_dirty);
>  
>  	/*
> -	 * If we are direct reclaiming for contiguous pages and we do
> +	 * If specific pages are needed such as with direct reclaiming
> +	 * for contiguous pages or for memory containers and we do
>  	 * not reclaim everything in the list, try again and wait
> -	 * for IO to complete. This will stall high-order allocations
> -	 * but that should be acceptable to the caller
> +	 * for IO to complete. This will stall callers that require
> +	 * specific pages but it should be acceptable to the caller
>  	 */
> -	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> -			sc->lumpy_reclaim_mode) {
> -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> +	if (sc->may_writepage && !current_is_kswapd() &&
> +			(sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> +		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
>  
> -		/*
> -		 * The attempt at page out may have made some
> -		 * of the pages active, mark them inactive again.
> -		 */
> -		nr_active = clear_active_flags(&page_list, NULL);
> -		count_vm_events(PGDEACTIVATE, nr_active);
> +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> +			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);

Yup, minding laptop_mode (together with may_writepage).  Agreed.

> +			congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
> -		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> +			/*
> +			 * The attempt at page out may have made some
> +			 * of the pages active, mark them inactive again.
> +			 */
> +			nr_active = clear_active_flags(&page_list, NULL);
> +			count_vm_events(PGDEACTIVATE, nr_active);
> +	
> +			nr_reclaimed += shrink_page_list(&page_list, sc,
> +						PAGEOUT_IO_SYNC, &nr_dirty);
> +		}
>  	}
>  
>  	local_irq_disable();

Thanks,
	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-21 13:04             ` Johannes Weiner
@ 2010-07-21 13:38               ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-21 13:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, Jul 21, 2010 at 03:04:35PM +0200, Johannes Weiner wrote:
> On Wed, Jul 21, 2010 at 12:52:50PM +0100, Mel Gorman wrote:
> > On Wed, Jul 21, 2010 at 12:02:18AM +0200, Johannes Weiner wrote:
> > > What
> > > I had in mind is the attached patch.  It is not tested and hacked up
> > > rather quickly due to time constraints, sorry, but you should get the
> > > idea.  I hope I did not miss anything fundamental.
> > > 
> > > Note that since only kswapd enters pageout() anymore, everything
> > > depending on PAGEOUT_IO_SYNC in there is moot, since there are no sync
> > > cycles for kswapd.  Just to mitigate the WTF-count on the patch :-)
> > > 
> > 
> > Anon page writeback can enter pageout. See
> > 
> > static inline bool reclaim_can_writeback(struct scan_control *sc,
> >                                         struct page *page)
> > {
> >         return !page_is_file_cache(page) || current_is_kswapd();
> > }
> > 
> > So the logic still applies.
> 
> Yeah, I noticed it only after looking at it again this morning.  My
> bad, it got a bit late when I wrote it.
> 

No worries, in an earlier version anon and file writeback were both
blocked and I suspect that was in the back of your mind somewhere.

> > > @@ -643,12 +639,14 @@ static noinline_for_stack void free_page
> > >   * shrink_page_list() returns the number of reclaimed pages
> > >   */
> > >  static unsigned long shrink_page_list(struct list_head *page_list,
> > > -					struct scan_control *sc,
> > > -					enum pageout_io sync_writeback)
> > > +				      struct scan_control *sc,
> > > +				      enum pageout_io sync_writeback,
> > > +				      int *dirty_seen)
> > >  {
> > >  	LIST_HEAD(ret_pages);
> > >  	LIST_HEAD(free_pages);
> > >  	int pgactivate = 0;
> > > +	unsigned long nr_dirty = 0;
> > >  	unsigned long nr_reclaimed = 0;
> > >  
> > >  	cond_resched();
> > > @@ -657,7 +655,7 @@ static unsigned long shrink_page_list(st
> > >  		enum page_references references;
> > >  		struct address_space *mapping;
> > >  		struct page *page;
> > > -		int may_enter_fs;
> > > +		int may_pageout;
> > >  
> > >  		cond_resched();
> > >  
> > > @@ -681,10 +679,15 @@ static unsigned long shrink_page_list(st
> > >  		if (page_mapped(page) || PageSwapCache(page))
> > >  			sc->nr_scanned++;
> > >  
> > > -		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> > > +		/*
> > > +		 * To prevent stack overflows, only kswapd can enter
> > > +		 * the filesystem.  Swap IO is always fine (for now).
> > > +		 */
> > > +		may_pageout = current_is_kswapd() ||
> > >  			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
> > >  
> > 
> > We lost the __GFP_FS check and it's vaguely possible kswapd could call the
> > allocator with GFP_NOFS. While you check it before wait_on_page_writeback it
> > needs to be checked before calling pageout(). I toyed around with
> > creating a may_pageout that took everything into account but I couldn't
> > convince myself there was no holes or serious change in functionality.
> 
> Yeah, I checked balance_pgdat(), saw GFP_KERNEL and went for it.  But
> it's probably better to keep such dependencies out.
> 

Ok.

> > Ok, is this closer to what you had in mind?
> 
> IMHO this is (almost) ready to get merged, so I am including the
> nitpicking comments :-)
> 
> > ==== CUT HERE ====
> > [PATCH] vmscan: Do not writeback filesystem pages in direct reclaim
> > 
> > When memory is under enough pressure, a process may enter direct
> > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > encountered during the scan, this page is written to backing storage using
> > mapping->writepage. This can result in very deep call stacks, particularly
> > if the target storage or filesystem are complex. It has already been observed
> > on XFS that the stack overflows but the problem is not XFS-specific.
> > 
> > This patch prevents direct reclaim writing back filesystem pages by checking
> > if current is kswapd or the page is anonymous before writing back.  If the
> > dirty pages cannot be written back, they are placed back on the LRU lists
> > for either background writing by the BDI threads or kswapd. If in direct
> > lumpy reclaim and dirty pages are encountered, the process will stall for
> > the background flusher before trying to reclaim the pages again.
> > 
> > As the call-chain for writing anonymous pages is not expected to be deep
> > and they are not cleaned by flusher threads, anonymous pages are still
> > written back in direct reclaim.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 6587155..e3a5816 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> 
> [...]
> 
> Does factoring pageout() still make sense in this patch?  It does not
> introduce a second callsite.
> 

It's not necessary anymore and just obscures the patch. I collapsed it.

> > @@ -639,18 +645,25 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> >  	pagevec_free(&freed_pvec);
> >  }
> >  
> > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> > +#define MAX_SWAP_CLEAN_WAIT 50
> 
> That's placed a bit randomly now that shrink_page_list() doesn't use
> it anymore.  I moved it just above shrink_inactive_list() but maybe it
> would be better at the file's head?
> 

I will move it to the top.

> >  /*
> >   * shrink_page_list() returns the number of reclaimed pages
> >   */
> >  static unsigned long shrink_page_list(struct list_head *page_list,
> >  					struct scan_control *sc,
> > -					enum pageout_io sync_writeback)
> > +					enum pageout_io sync_writeback,
> > +					unsigned long *nr_still_dirty)
> >  {
> > -	LIST_HEAD(ret_pages);
> >  	LIST_HEAD(free_pages);
> > -	int pgactivate = 0;
> > +	LIST_HEAD(putback_pages);
> > +	LIST_HEAD(dirty_pages);
> > +	int pgactivate;
> > +	unsigned long nr_dirty = 0;
> >  	unsigned long nr_reclaimed = 0;
> >  
> > +	pgactivate = 0;
> 
> Spurious change?
> 

Yes, was previously needed for the restart_dirty. Now it's a stupid
change.

> >  	cond_resched();
> >  
> >  	while (!list_empty(page_list)) {
> > @@ -741,7 +754,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  			}
> >  		}
> >  
> > -		if (PageDirty(page)) {
> > +		if (PageDirty(page))  {
> 
> Ha!
> 

:) fixed.

> > +			/*
> > +			 * Only kswapd can writeback filesystem pages to
> > +			 * avoid risk of stack overflow
> > +			 */
> > +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> > +				list_add(&page->lru, &dirty_pages);
> > +				unlock_page(page);
> > +				nr_dirty++;
> > +				goto keep_dirty;
> > +			}
> 
> I don't understand why you keep the extra dirty list.  Couldn't this
> just be `goto keep_locked'?
> 

Yep, because we are no longer looping to retry dirty pages.

> >  			if (references == PAGEREF_RECLAIM_CLEAN)
> >  				goto keep_locked;
> >  			if (!may_enter_fs)
> > @@ -852,13 +876,19 @@ activate_locked:
> >  keep_locked:
> >  		unlock_page(page);
> >  keep:
> > -		list_add(&page->lru, &ret_pages);
> > +		list_add(&page->lru, &putback_pages);
> > +keep_dirty:
> >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> >  	}
> >  
> >  	free_page_list(&free_pages);
> >  
> > -	list_splice(&ret_pages, page_list);
> > +	if (nr_dirty) {
> > +		*nr_still_dirty = nr_dirty;
> 
> You either have to set *nr_still_dirty unconditionally or
> (re)initialize the variable in shrink_inactive_list().
> 

Unconditionally happening now.

> > +		list_splice(&dirty_pages, page_list);
> > +	}
> > +	list_splice(&putback_pages, page_list);
> 
> When we retry those pages, the dirty ones come last on the list.  Was
> this maybe the intention behind collecting dirties separately?
> 

No, the intention was to only recycle dirty pages but it's not very
important.

> > @@ -1245,6 +1275,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >  	unsigned long nr_active;
> >  	unsigned long nr_anon;
> >  	unsigned long nr_file;
> > +	unsigned long nr_dirty;
> >  
> >  	while (unlikely(too_many_isolated(zone, file, sc))) {
> >  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > @@ -1293,26 +1324,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >  
> >  	spin_unlock_irq(&zone->lru_lock);
> >  
> > -	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> > +	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> > +								&nr_dirty);
> >  
> >  	/*
> > -	 * If we are direct reclaiming for contiguous pages and we do
> > +	 * If specific pages are needed such as with direct reclaiming
> > +	 * for contiguous pages or for memory containers and we do
> >  	 * not reclaim everything in the list, try again and wait
> > -	 * for IO to complete. This will stall high-order allocations
> > -	 * but that should be acceptable to the caller
> > +	 * for IO to complete. This will stall callers that require
> > +	 * specific pages but it should be acceptable to the caller
> >  	 */
> > -	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> > -			sc->lumpy_reclaim_mode) {
> > -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +	if (sc->may_writepage && !current_is_kswapd() &&
> > +			(sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> > +		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> >  
> > -		/*
> > -		 * The attempt at page out may have made some
> > -		 * of the pages active, mark them inactive again.
> > -		 */
> > -		nr_active = clear_active_flags(&page_list, NULL);
> > -		count_vm_events(PGDEACTIVATE, nr_active);
> > +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > +			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> 
> Yup, minding laptop_mode (together with may_writepage).  Agreed.
> 
> > +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> >  
> > -		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > +			/*
> > +			 * The attempt at page out may have made some
> > +			 * of the pages active, mark them inactive again.
> > +			 */
> > +			nr_active = clear_active_flags(&page_list, NULL);
> > +			count_vm_events(PGDEACTIVATE, nr_active);
> > +	
> > +			nr_reclaimed += shrink_page_list(&page_list, sc,
> > +						PAGEOUT_IO_SYNC, &nr_dirty);
> > +		}
> >  	}
> >  
> >  	local_irq_disable();
> 

Here is an updated version. Thanks very much

==== CUT HERE ====
vmscan: Do not writeback filesystem pages in direct reclaim

When memory is under enough pressure, a process may enter direct
reclaim to free pages in the same manner kswapd does. If a dirty page is
encountered during the scan, this page is written to backing storage using
mapping->writepage. This can result in very deep call stacks, particularly
if the target storage or filesystem are complex. It has already been observed
on XFS that the stack overflows but the problem is not XFS-specific.

This patch prevents direct reclaim writing back filesystem pages by checking
if current is kswapd or the page is anonymous before writing back.  If the
dirty pages cannot be written back, they are placed back on the LRU lists
for either background writing by the BDI threads or kswapd. If in direct
lumpy reclaim and dirty pages are encountered, the process will stall for
the background flusher before trying to reclaim the pages again.

As the call-chain for writing anonymous pages is not expected to be deep
and they are not cleaned by flusher threads, anonymous pages are still
written back in direct reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   55 +++++++++++++++++++++++++++++++++++++++----------------
 1 files changed, 39 insertions(+), 16 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6587155..45d9934 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -139,6 +139,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
 #define scanning_global_lru(sc)	(1)
 #endif
 
+/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
+#define MAX_SWAP_CLEAN_WAIT 50
+
 static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
 						  struct scan_control *sc)
 {
@@ -644,11 +647,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
 					struct scan_control *sc,
-					enum pageout_io sync_writeback)
+					enum pageout_io sync_writeback,
+					unsigned long *nr_still_dirty)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
 	int pgactivate = 0;
+	unsigned long nr_dirty = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
@@ -742,6 +747,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		if (PageDirty(page)) {
+			/*
+			 * Only kswapd can writeback filesystem pages to
+			 * avoid risk of stack overflow
+			 */
+			if (page_is_file_cache(page) && !current_is_kswapd()) {
+				nr_dirty++;
+				goto keep_locked;
+			}
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
@@ -858,7 +872,7 @@ keep:
 
 	free_page_list(&free_pages);
 
-	list_splice(&ret_pages, page_list);
+	*nr_still_dirty = nr_dirty;
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
@@ -1245,6 +1259,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	unsigned long nr_active;
 	unsigned long nr_anon;
 	unsigned long nr_file;
+	unsigned long nr_dirty;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1293,26 +1308,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
+								&nr_dirty);
 
 	/*
-	 * If we are direct reclaiming for contiguous pages and we do
+	 * If specific pages are needed such as with direct reclaiming
+	 * for contiguous pages or for memory containers and we do
 	 * not reclaim everything in the list, try again and wait
-	 * for IO to complete. This will stall high-order allocations
-	 * but that should be acceptable to the caller
+	 * for IO to complete. This will stall callers that require
+	 * specific pages but it should be acceptable to the caller
 	 */
-	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
-			sc->lumpy_reclaim_mode) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+	if (sc->may_writepage && !current_is_kswapd() &&
+			(sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
+		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
 
-		/*
-		 * The attempt at page out may have made some
-		 * of the pages active, mark them inactive again.
-		 */
-		nr_active = clear_active_flags(&page_list, NULL);
-		count_vm_events(PGDEACTIVATE, nr_active);
+		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
+			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
+			congestion_wait(BLK_RW_ASYNC, HZ/10);
 
-		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+			/*
+			 * The attempt at page out may have made some
+			 * of the pages active, mark them inactive again.
+			 */
+			nr_active = clear_active_flags(&page_list, NULL);
+			count_vm_events(PGDEACTIVATE, nr_active);
+	
+			nr_reclaimed += shrink_page_list(&page_list, sc,
+						PAGEOUT_IO_SYNC, &nr_dirty);
+		}
 	}
 
 	local_irq_disable();


^ permalink raw reply related	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-21 13:38               ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-21 13:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, Jul 21, 2010 at 03:04:35PM +0200, Johannes Weiner wrote:
> On Wed, Jul 21, 2010 at 12:52:50PM +0100, Mel Gorman wrote:
> > On Wed, Jul 21, 2010 at 12:02:18AM +0200, Johannes Weiner wrote:
> > > What
> > > I had in mind is the attached patch.  It is not tested and hacked up
> > > rather quickly due to time constraints, sorry, but you should get the
> > > idea.  I hope I did not miss anything fundamental.
> > > 
> > > Note that since only kswapd enters pageout() anymore, everything
> > > depending on PAGEOUT_IO_SYNC in there is moot, since there are no sync
> > > cycles for kswapd.  Just to mitigate the WTF-count on the patch :-)
> > > 
> > 
> > Anon page writeback can enter pageout. See
> > 
> > static inline bool reclaim_can_writeback(struct scan_control *sc,
> >                                         struct page *page)
> > {
> >         return !page_is_file_cache(page) || current_is_kswapd();
> > }
> > 
> > So the logic still applies.
> 
> Yeah, I noticed it only after looking at it again this morning.  My
> bad, it got a bit late when I wrote it.
> 

No worries, in an earlier version anon and file writeback were both
blocked and I suspect that was in the back of your mind somewhere.

> > > @@ -643,12 +639,14 @@ static noinline_for_stack void free_page
> > >   * shrink_page_list() returns the number of reclaimed pages
> > >   */
> > >  static unsigned long shrink_page_list(struct list_head *page_list,
> > > -					struct scan_control *sc,
> > > -					enum pageout_io sync_writeback)
> > > +				      struct scan_control *sc,
> > > +				      enum pageout_io sync_writeback,
> > > +				      int *dirty_seen)
> > >  {
> > >  	LIST_HEAD(ret_pages);
> > >  	LIST_HEAD(free_pages);
> > >  	int pgactivate = 0;
> > > +	unsigned long nr_dirty = 0;
> > >  	unsigned long nr_reclaimed = 0;
> > >  
> > >  	cond_resched();
> > > @@ -657,7 +655,7 @@ static unsigned long shrink_page_list(st
> > >  		enum page_references references;
> > >  		struct address_space *mapping;
> > >  		struct page *page;
> > > -		int may_enter_fs;
> > > +		int may_pageout;
> > >  
> > >  		cond_resched();
> > >  
> > > @@ -681,10 +679,15 @@ static unsigned long shrink_page_list(st
> > >  		if (page_mapped(page) || PageSwapCache(page))
> > >  			sc->nr_scanned++;
> > >  
> > > -		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> > > +		/*
> > > +		 * To prevent stack overflows, only kswapd can enter
> > > +		 * the filesystem.  Swap IO is always fine (for now).
> > > +		 */
> > > +		may_pageout = current_is_kswapd() ||
> > >  			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
> > >  
> > 
> > We lost the __GFP_FS check and it's vaguely possible kswapd could call the
> > allocator with GFP_NOFS. While you check it before wait_on_page_writeback it
> > needs to be checked before calling pageout(). I toyed around with
> > creating a may_pageout that took everything into account but I couldn't
> > convince myself there was no holes or serious change in functionality.
> 
> Yeah, I checked balance_pgdat(), saw GFP_KERNEL and went for it.  But
> it's probably better to keep such dependencies out.
> 

Ok.

> > Ok, is this closer to what you had in mind?
> 
> IMHO this is (almost) ready to get merged, so I am including the
> nitpicking comments :-)
> 
> > ==== CUT HERE ====
> > [PATCH] vmscan: Do not writeback filesystem pages in direct reclaim
> > 
> > When memory is under enough pressure, a process may enter direct
> > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > encountered during the scan, this page is written to backing storage using
> > mapping->writepage. This can result in very deep call stacks, particularly
> > if the target storage or filesystem are complex. It has already been observed
> > on XFS that the stack overflows but the problem is not XFS-specific.
> > 
> > This patch prevents direct reclaim writing back filesystem pages by checking
> > if current is kswapd or the page is anonymous before writing back.  If the
> > dirty pages cannot be written back, they are placed back on the LRU lists
> > for either background writing by the BDI threads or kswapd. If in direct
> > lumpy reclaim and dirty pages are encountered, the process will stall for
> > the background flusher before trying to reclaim the pages again.
> > 
> > As the call-chain for writing anonymous pages is not expected to be deep
> > and they are not cleaned by flusher threads, anonymous pages are still
> > written back in direct reclaim.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 6587155..e3a5816 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> 
> [...]
> 
> Does factoring pageout() still make sense in this patch?  It does not
> introduce a second callsite.
> 

It's not necessary anymore and just obscures the patch. I collapsed it.

> > @@ -639,18 +645,25 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> >  	pagevec_free(&freed_pvec);
> >  }
> >  
> > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> > +#define MAX_SWAP_CLEAN_WAIT 50
> 
> That's placed a bit randomly now that shrink_page_list() doesn't use
> it anymore.  I moved it just above shrink_inactive_list() but maybe it
> would be better at the file's head?
> 

I will move it to the top.

> >  /*
> >   * shrink_page_list() returns the number of reclaimed pages
> >   */
> >  static unsigned long shrink_page_list(struct list_head *page_list,
> >  					struct scan_control *sc,
> > -					enum pageout_io sync_writeback)
> > +					enum pageout_io sync_writeback,
> > +					unsigned long *nr_still_dirty)
> >  {
> > -	LIST_HEAD(ret_pages);
> >  	LIST_HEAD(free_pages);
> > -	int pgactivate = 0;
> > +	LIST_HEAD(putback_pages);
> > +	LIST_HEAD(dirty_pages);
> > +	int pgactivate;
> > +	unsigned long nr_dirty = 0;
> >  	unsigned long nr_reclaimed = 0;
> >  
> > +	pgactivate = 0;
> 
> Spurious change?
> 

Yes, was previously needed for the restart_dirty. Now it's a stupid
change.

> >  	cond_resched();
> >  
> >  	while (!list_empty(page_list)) {
> > @@ -741,7 +754,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  			}
> >  		}
> >  
> > -		if (PageDirty(page)) {
> > +		if (PageDirty(page))  {
> 
> Ha!
> 

:) fixed.

> > +			/*
> > +			 * Only kswapd can writeback filesystem pages to
> > +			 * avoid risk of stack overflow
> > +			 */
> > +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> > +				list_add(&page->lru, &dirty_pages);
> > +				unlock_page(page);
> > +				nr_dirty++;
> > +				goto keep_dirty;
> > +			}
> 
> I don't understand why you keep the extra dirty list.  Couldn't this
> just be `goto keep_locked'?
> 

Yep, because we are no longer looping to retry dirty pages.

> >  			if (references == PAGEREF_RECLAIM_CLEAN)
> >  				goto keep_locked;
> >  			if (!may_enter_fs)
> > @@ -852,13 +876,19 @@ activate_locked:
> >  keep_locked:
> >  		unlock_page(page);
> >  keep:
> > -		list_add(&page->lru, &ret_pages);
> > +		list_add(&page->lru, &putback_pages);
> > +keep_dirty:
> >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> >  	}
> >  
> >  	free_page_list(&free_pages);
> >  
> > -	list_splice(&ret_pages, page_list);
> > +	if (nr_dirty) {
> > +		*nr_still_dirty = nr_dirty;
> 
> You either have to set *nr_still_dirty unconditionally or
> (re)initialize the variable in shrink_inactive_list().
> 

Unconditionally happening now.

> > +		list_splice(&dirty_pages, page_list);
> > +	}
> > +	list_splice(&putback_pages, page_list);
> 
> When we retry those pages, the dirty ones come last on the list.  Was
> this maybe the intention behind collecting dirties separately?
> 

No, the intention was to only recycle dirty pages but it's not very
important.

> > @@ -1245,6 +1275,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >  	unsigned long nr_active;
> >  	unsigned long nr_anon;
> >  	unsigned long nr_file;
> > +	unsigned long nr_dirty;
> >  
> >  	while (unlikely(too_many_isolated(zone, file, sc))) {
> >  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > @@ -1293,26 +1324,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >  
> >  	spin_unlock_irq(&zone->lru_lock);
> >  
> > -	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> > +	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> > +								&nr_dirty);
> >  
> >  	/*
> > -	 * If we are direct reclaiming for contiguous pages and we do
> > +	 * If specific pages are needed such as with direct reclaiming
> > +	 * for contiguous pages or for memory containers and we do
> >  	 * not reclaim everything in the list, try again and wait
> > -	 * for IO to complete. This will stall high-order allocations
> > -	 * but that should be acceptable to the caller
> > +	 * for IO to complete. This will stall callers that require
> > +	 * specific pages but it should be acceptable to the caller
> >  	 */
> > -	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> > -			sc->lumpy_reclaim_mode) {
> > -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +	if (sc->may_writepage && !current_is_kswapd() &&
> > +			(sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> > +		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> >  
> > -		/*
> > -		 * The attempt at page out may have made some
> > -		 * of the pages active, mark them inactive again.
> > -		 */
> > -		nr_active = clear_active_flags(&page_list, NULL);
> > -		count_vm_events(PGDEACTIVATE, nr_active);
> > +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > +			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> 
> Yup, minding laptop_mode (together with may_writepage).  Agreed.
> 
> > +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> >  
> > -		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > +			/*
> > +			 * The attempt at page out may have made some
> > +			 * of the pages active, mark them inactive again.
> > +			 */
> > +			nr_active = clear_active_flags(&page_list, NULL);
> > +			count_vm_events(PGDEACTIVATE, nr_active);
> > +	
> > +			nr_reclaimed += shrink_page_list(&page_list, sc,
> > +						PAGEOUT_IO_SYNC, &nr_dirty);
> > +		}
> >  	}
> >  
> >  	local_irq_disable();
> 

Here is an updated version. Thanks very much

==== CUT HERE ====
vmscan: Do not writeback filesystem pages in direct reclaim

When memory is under enough pressure, a process may enter direct
reclaim to free pages in the same manner kswapd does. If a dirty page is
encountered during the scan, this page is written to backing storage using
mapping->writepage. This can result in very deep call stacks, particularly
if the target storage or filesystem are complex. It has already been observed
on XFS that the stack overflows but the problem is not XFS-specific.

This patch prevents direct reclaim writing back filesystem pages by checking
if current is kswapd or the page is anonymous before writing back.  If the
dirty pages cannot be written back, they are placed back on the LRU lists
for either background writing by the BDI threads or kswapd. If in direct
lumpy reclaim and dirty pages are encountered, the process will stall for
the background flusher before trying to reclaim the pages again.

As the call-chain for writing anonymous pages is not expected to be deep
and they are not cleaned by flusher threads, anonymous pages are still
written back in direct reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   55 +++++++++++++++++++++++++++++++++++++++----------------
 1 files changed, 39 insertions(+), 16 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6587155..45d9934 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -139,6 +139,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
 #define scanning_global_lru(sc)	(1)
 #endif
 
+/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
+#define MAX_SWAP_CLEAN_WAIT 50
+
 static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
 						  struct scan_control *sc)
 {
@@ -644,11 +647,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
 					struct scan_control *sc,
-					enum pageout_io sync_writeback)
+					enum pageout_io sync_writeback,
+					unsigned long *nr_still_dirty)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
 	int pgactivate = 0;
+	unsigned long nr_dirty = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
@@ -742,6 +747,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		if (PageDirty(page)) {
+			/*
+			 * Only kswapd can writeback filesystem pages to
+			 * avoid risk of stack overflow
+			 */
+			if (page_is_file_cache(page) && !current_is_kswapd()) {
+				nr_dirty++;
+				goto keep_locked;
+			}
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
@@ -858,7 +872,7 @@ keep:
 
 	free_page_list(&free_pages);
 
-	list_splice(&ret_pages, page_list);
+	*nr_still_dirty = nr_dirty;
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
@@ -1245,6 +1259,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	unsigned long nr_active;
 	unsigned long nr_anon;
 	unsigned long nr_file;
+	unsigned long nr_dirty;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1293,26 +1308,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
+								&nr_dirty);
 
 	/*
-	 * If we are direct reclaiming for contiguous pages and we do
+	 * If specific pages are needed such as with direct reclaiming
+	 * for contiguous pages or for memory containers and we do
 	 * not reclaim everything in the list, try again and wait
-	 * for IO to complete. This will stall high-order allocations
-	 * but that should be acceptable to the caller
+	 * for IO to complete. This will stall callers that require
+	 * specific pages but it should be acceptable to the caller
 	 */
-	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
-			sc->lumpy_reclaim_mode) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+	if (sc->may_writepage && !current_is_kswapd() &&
+			(sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
+		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
 
-		/*
-		 * The attempt at page out may have made some
-		 * of the pages active, mark them inactive again.
-		 */
-		nr_active = clear_active_flags(&page_list, NULL);
-		count_vm_events(PGDEACTIVATE, nr_active);
+		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
+			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
+			congestion_wait(BLK_RW_ASYNC, HZ/10);
 
-		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+			/*
+			 * The attempt at page out may have made some
+			 * of the pages active, mark them inactive again.
+			 */
+			nr_active = clear_active_flags(&page_list, NULL);
+			count_vm_events(PGDEACTIVATE, nr_active);
+	
+			nr_reclaimed += shrink_page_list(&page_list, sc,
+						PAGEOUT_IO_SYNC, &nr_dirty);
+		}
 	}
 
 	local_irq_disable();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-21 12:01             ` KAMEZAWA Hiroyuki
@ 2010-07-21 14:27               ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-21 14:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Christoph Hellwig, Wu Fengguang, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, Jul 21, 2010 at 09:01:11PM +0900, KAMEZAWA Hiroyuki wrote:
> > <SNIP>
> >  
> >  	/*
> > -	 * If we are direct reclaiming for contiguous pages and we do
> > +	 * If specific pages are needed such as with direct reclaiming
> > +	 * for contiguous pages or for memory containers and we do
> >  	 * not reclaim everything in the list, try again and wait
> > -	 * for IO to complete. This will stall high-order allocations
> > -	 * but that should be acceptable to the caller
> > +	 * for IO to complete. This will stall callers that require
> > +	 * specific pages but it should be acceptable to the caller
> >  	 */
> > -	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> > -			sc->lumpy_reclaim_mode) {
> > -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +	if (sc->may_writepage && !current_is_kswapd() &&
> > +			(sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> > +		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> 
> Hmm, ok. I see what will happen to memcg.

Thanks

> But, hmm, memcg will have to select to enter this rounine based on
> the result of 1st memory reclaim.
> 

It has the option of igoring pages being dirtied but I worry that the
container could be filled with dirty pages waiting for flushers to do
something.

> >  
> > -		/*
> > -		 * The attempt at page out may have made some
> > -		 * of the pages active, mark them inactive again.
> > -		 */
> > -		nr_active = clear_active_flags(&page_list, NULL);
> > -		count_vm_events(PGDEACTIVATE, nr_active);
> > +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > +			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> > +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> >  
>
> Congestion wait is required ?? Where the congestion happens ?
> I'm sorry you already have some other trick in other patch.
> 

It's to wait for the IO to occur.

> > -		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > +			/*
> > +			 * The attempt at page out may have made some
> > +			 * of the pages active, mark them inactive again.
> > +			 */
> > +			nr_active = clear_active_flags(&page_list, NULL);
> > +			count_vm_events(PGDEACTIVATE, nr_active);
> > +	
> > +			nr_reclaimed += shrink_page_list(&page_list, sc,
> > +						PAGEOUT_IO_SYNC, &nr_dirty);
> > +		}
> 
> Just a question. This PAGEOUT_IO_SYNC has some meanings ?
> 

Yes, in pageout it will wait on pages currently being written back to be
cleaned before trying to reclaim them.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-21 14:27               ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-21 14:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Christoph Hellwig, Wu Fengguang, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, Jul 21, 2010 at 09:01:11PM +0900, KAMEZAWA Hiroyuki wrote:
> > <SNIP>
> >  
> >  	/*
> > -	 * If we are direct reclaiming for contiguous pages and we do
> > +	 * If specific pages are needed such as with direct reclaiming
> > +	 * for contiguous pages or for memory containers and we do
> >  	 * not reclaim everything in the list, try again and wait
> > -	 * for IO to complete. This will stall high-order allocations
> > -	 * but that should be acceptable to the caller
> > +	 * for IO to complete. This will stall callers that require
> > +	 * specific pages but it should be acceptable to the caller
> >  	 */
> > -	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> > -			sc->lumpy_reclaim_mode) {
> > -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +	if (sc->may_writepage && !current_is_kswapd() &&
> > +			(sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> > +		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> 
> Hmm, ok. I see what will happen to memcg.

Thanks

> But, hmm, memcg will have to select to enter this rounine based on
> the result of 1st memory reclaim.
> 

It has the option of igoring pages being dirtied but I worry that the
container could be filled with dirty pages waiting for flushers to do
something.

> >  
> > -		/*
> > -		 * The attempt at page out may have made some
> > -		 * of the pages active, mark them inactive again.
> > -		 */
> > -		nr_active = clear_active_flags(&page_list, NULL);
> > -		count_vm_events(PGDEACTIVATE, nr_active);
> > +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > +			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> > +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> >  
>
> Congestion wait is required ?? Where the congestion happens ?
> I'm sorry you already have some other trick in other patch.
> 

It's to wait for the IO to occur.

> > -		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > +			/*
> > +			 * The attempt at page out may have made some
> > +			 * of the pages active, mark them inactive again.
> > +			 */
> > +			nr_active = clear_active_flags(&page_list, NULL);
> > +			count_vm_events(PGDEACTIVATE, nr_active);
> > +	
> > +			nr_reclaimed += shrink_page_list(&page_list, sc,
> > +						PAGEOUT_IO_SYNC, &nr_dirty);
> > +		}
> 
> Just a question. This PAGEOUT_IO_SYNC has some meanings ?
> 

Yes, in pageout it will wait on pages currently being written back to be
cleaned before trying to reclaim them.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-21 13:38               ` Mel Gorman
@ 2010-07-21 14:28                 ` Johannes Weiner
  -1 siblings, 0 replies; 177+ messages in thread
From: Johannes Weiner @ 2010-07-21 14:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, Jul 21, 2010 at 02:38:57PM +0100, Mel Gorman wrote:
> Here is an updated version. Thanks very much
> 
> ==== CUT HERE ====
> vmscan: Do not writeback filesystem pages in direct reclaim
> 
> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
> 
> This patch prevents direct reclaim writing back filesystem pages by checking
> if current is kswapd or the page is anonymous before writing back.  If the
> dirty pages cannot be written back, they are placed back on the LRU lists
> for either background writing by the BDI threads or kswapd. If in direct
> lumpy reclaim and dirty pages are encountered, the process will stall for
> the background flusher before trying to reclaim the pages again.
> 
> As the call-chain for writing anonymous pages is not expected to be deep
> and they are not cleaned by flusher threads, anonymous pages are still
> written back in direct reclaim.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>

Cool!

Except for one last tiny thing...

> @@ -858,7 +872,7 @@ keep:
>  
>  	free_page_list(&free_pages);
>  
> -	list_splice(&ret_pages, page_list);

This will lose all retry pages forever, I think.

> +	*nr_still_dirty = nr_dirty;
>  	count_vm_events(PGACTIVATE, pgactivate);
>  	return nr_reclaimed;
>  }

Otherwise,
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-21 14:28                 ` Johannes Weiner
  0 siblings, 0 replies; 177+ messages in thread
From: Johannes Weiner @ 2010-07-21 14:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, Jul 21, 2010 at 02:38:57PM +0100, Mel Gorman wrote:
> Here is an updated version. Thanks very much
> 
> ==== CUT HERE ====
> vmscan: Do not writeback filesystem pages in direct reclaim
> 
> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
> 
> This patch prevents direct reclaim writing back filesystem pages by checking
> if current is kswapd or the page is anonymous before writing back.  If the
> dirty pages cannot be written back, they are placed back on the LRU lists
> for either background writing by the BDI threads or kswapd. If in direct
> lumpy reclaim and dirty pages are encountered, the process will stall for
> the background flusher before trying to reclaim the pages again.
> 
> As the call-chain for writing anonymous pages is not expected to be deep
> and they are not cleaned by flusher threads, anonymous pages are still
> written back in direct reclaim.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>

Cool!

Except for one last tiny thing...

> @@ -858,7 +872,7 @@ keep:
>  
>  	free_page_list(&free_pages);
>  
> -	list_splice(&ret_pages, page_list);

This will lose all retry pages forever, I think.

> +	*nr_still_dirty = nr_dirty;
>  	count_vm_events(PGACTIVATE, pgactivate);
>  	return nr_reclaimed;
>  }

Otherwise,
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-21 14:28                 ` Johannes Weiner
@ 2010-07-21 14:31                   ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-21 14:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, Jul 21, 2010 at 04:28:44PM +0200, Johannes Weiner wrote:
> On Wed, Jul 21, 2010 at 02:38:57PM +0100, Mel Gorman wrote:
> > Here is an updated version. Thanks very much
> > 
> > ==== CUT HERE ====
> > vmscan: Do not writeback filesystem pages in direct reclaim
> > 
> > When memory is under enough pressure, a process may enter direct
> > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > encountered during the scan, this page is written to backing storage using
> > mapping->writepage. This can result in very deep call stacks, particularly
> > if the target storage or filesystem are complex. It has already been observed
> > on XFS that the stack overflows but the problem is not XFS-specific.
> > 
> > This patch prevents direct reclaim writing back filesystem pages by checking
> > if current is kswapd or the page is anonymous before writing back.  If the
> > dirty pages cannot be written back, they are placed back on the LRU lists
> > for either background writing by the BDI threads or kswapd. If in direct
> > lumpy reclaim and dirty pages are encountered, the process will stall for
> > the background flusher before trying to reclaim the pages again.
> > 
> > As the call-chain for writing anonymous pages is not expected to be deep
> > and they are not cleaned by flusher threads, anonymous pages are still
> > written back in direct reclaim.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> 
> Cool!
> 
> Except for one last tiny thing...
> 
> > @@ -858,7 +872,7 @@ keep:
> >  
> >  	free_page_list(&free_pages);
> >  
> > -	list_splice(&ret_pages, page_list);
> 
> This will lose all retry pages forever, I think.
> 

Above this is

while (!list_empty(page_list)) {
	...
}

page_list should be empty and keep_locked is putting the pages on ret_pages
already so I think it's ok.

> > +	*nr_still_dirty = nr_dirty;
> >  	count_vm_events(PGACTIVATE, pgactivate);
> >  	return nr_reclaimed;
> >  }
> 
> Otherwise,
> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> 

Thanks!

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-21 14:31                   ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-21 14:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, Jul 21, 2010 at 04:28:44PM +0200, Johannes Weiner wrote:
> On Wed, Jul 21, 2010 at 02:38:57PM +0100, Mel Gorman wrote:
> > Here is an updated version. Thanks very much
> > 
> > ==== CUT HERE ====
> > vmscan: Do not writeback filesystem pages in direct reclaim
> > 
> > When memory is under enough pressure, a process may enter direct
> > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > encountered during the scan, this page is written to backing storage using
> > mapping->writepage. This can result in very deep call stacks, particularly
> > if the target storage or filesystem are complex. It has already been observed
> > on XFS that the stack overflows but the problem is not XFS-specific.
> > 
> > This patch prevents direct reclaim writing back filesystem pages by checking
> > if current is kswapd or the page is anonymous before writing back.  If the
> > dirty pages cannot be written back, they are placed back on the LRU lists
> > for either background writing by the BDI threads or kswapd. If in direct
> > lumpy reclaim and dirty pages are encountered, the process will stall for
> > the background flusher before trying to reclaim the pages again.
> > 
> > As the call-chain for writing anonymous pages is not expected to be deep
> > and they are not cleaned by flusher threads, anonymous pages are still
> > written back in direct reclaim.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> 
> Cool!
> 
> Except for one last tiny thing...
> 
> > @@ -858,7 +872,7 @@ keep:
> >  
> >  	free_page_list(&free_pages);
> >  
> > -	list_splice(&ret_pages, page_list);
> 
> This will lose all retry pages forever, I think.
> 

Above this is

while (!list_empty(page_list)) {
	...
}

page_list should be empty and keep_locked is putting the pages on ret_pages
already so I think it's ok.

> > +	*nr_still_dirty = nr_dirty;
> >  	count_vm_events(PGACTIVATE, pgactivate);
> >  	return nr_reclaimed;
> >  }
> 
> Otherwise,
> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> 

Thanks!

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-21 14:31                   ` Mel Gorman
@ 2010-07-21 14:39                     ` Johannes Weiner
  -1 siblings, 0 replies; 177+ messages in thread
From: Johannes Weiner @ 2010-07-21 14:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, Jul 21, 2010 at 03:31:19PM +0100, Mel Gorman wrote:
> On Wed, Jul 21, 2010 at 04:28:44PM +0200, Johannes Weiner wrote:
> > On Wed, Jul 21, 2010 at 02:38:57PM +0100, Mel Gorman wrote:
> > > @@ -858,7 +872,7 @@ keep:
> > >  
> > >  	free_page_list(&free_pages);
> > >  
> > > -	list_splice(&ret_pages, page_list);
> > 
> > This will lose all retry pages forever, I think.
> > 
> 
> Above this is
> 
> while (!list_empty(page_list)) {
> 	...
> }
> 
> page_list should be empty and keep_locked is putting the pages on ret_pages
> already so I think it's ok.

But ret_pages is function-local.  Putting them back on the then-empty
page_list is to give them back to the caller, otherwise they are lost
in a dead stack slot.

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-21 14:39                     ` Johannes Weiner
  0 siblings, 0 replies; 177+ messages in thread
From: Johannes Weiner @ 2010-07-21 14:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, Jul 21, 2010 at 03:31:19PM +0100, Mel Gorman wrote:
> On Wed, Jul 21, 2010 at 04:28:44PM +0200, Johannes Weiner wrote:
> > On Wed, Jul 21, 2010 at 02:38:57PM +0100, Mel Gorman wrote:
> > > @@ -858,7 +872,7 @@ keep:
> > >  
> > >  	free_page_list(&free_pages);
> > >  
> > > -	list_splice(&ret_pages, page_list);
> > 
> > This will lose all retry pages forever, I think.
> > 
> 
> Above this is
> 
> while (!list_empty(page_list)) {
> 	...
> }
> 
> page_list should be empty and keep_locked is putting the pages on ret_pages
> already so I think it's ok.

But ret_pages is function-local.  Putting them back on the then-empty
page_list is to give them back to the caller, otherwise they are lost
in a dead stack slot.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-21 14:39                     ` Johannes Weiner
@ 2010-07-21 15:06                       ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-21 15:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, Jul 21, 2010 at 04:39:56PM +0200, Johannes Weiner wrote:
> On Wed, Jul 21, 2010 at 03:31:19PM +0100, Mel Gorman wrote:
> > On Wed, Jul 21, 2010 at 04:28:44PM +0200, Johannes Weiner wrote:
> > > On Wed, Jul 21, 2010 at 02:38:57PM +0100, Mel Gorman wrote:
> > > > @@ -858,7 +872,7 @@ keep:
> > > >  
> > > >  	free_page_list(&free_pages);
> > > >  
> > > > -	list_splice(&ret_pages, page_list);
> > > 
> > > This will lose all retry pages forever, I think.
> > > 
> > 
> > Above this is
> > 
> > while (!list_empty(page_list)) {
> > 	...
> > }
> > 
> > page_list should be empty and keep_locked is putting the pages on ret_pages
> > already so I think it's ok.
> 
> But ret_pages is function-local.  Putting them back on the then-empty
> page_list is to give them back to the caller, otherwise they are lost
> in a dead stack slot.
> 

Bah, you're right, it is repaired now. /me slaps self. Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-21 15:06                       ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-21 15:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, Jul 21, 2010 at 04:39:56PM +0200, Johannes Weiner wrote:
> On Wed, Jul 21, 2010 at 03:31:19PM +0100, Mel Gorman wrote:
> > On Wed, Jul 21, 2010 at 04:28:44PM +0200, Johannes Weiner wrote:
> > > On Wed, Jul 21, 2010 at 02:38:57PM +0100, Mel Gorman wrote:
> > > > @@ -858,7 +872,7 @@ keep:
> > > >  
> > > >  	free_page_list(&free_pages);
> > > >  
> > > > -	list_splice(&ret_pages, page_list);
> > > 
> > > This will lose all retry pages forever, I think.
> > > 
> > 
> > Above this is
> > 
> > while (!list_empty(page_list)) {
> > 	...
> > }
> > 
> > page_list should be empty and keep_locked is putting the pages on ret_pages
> > already so I think it's ok.
> 
> But ret_pages is function-local.  Putting them back on the then-empty
> page_list is to give them back to the caller, otherwise they are lost
> in a dead stack slot.
> 

Bah, you're right, it is repaired now. /me slaps self. Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-21 14:27               ` Mel Gorman
@ 2010-07-21 23:57                 ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 177+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-21 23:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Christoph Hellwig, Wu Fengguang, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, 21 Jul 2010 15:27:10 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Wed, Jul 21, 2010 at 09:01:11PM +0900, KAMEZAWA Hiroyuki wrote:
 
> > But, hmm, memcg will have to select to enter this rounine based on
> > the result of 1st memory reclaim.
> > 
> 
> It has the option of igoring pages being dirtied but I worry that the
> container could be filled with dirty pages waiting for flushers to do
> something.

I'll prepare dirty_ratio for memcg. It's not easy but requested by I/O cgroup
guys, too...


> 
> > >  
> > > -		/*
> > > -		 * The attempt at page out may have made some
> > > -		 * of the pages active, mark them inactive again.
> > > -		 */
> > > -		nr_active = clear_active_flags(&page_list, NULL);
> > > -		count_vm_events(PGDEACTIVATE, nr_active);
> > > +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > > +			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> > > +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> > >  
> >
> > Congestion wait is required ?? Where the congestion happens ?
> > I'm sorry you already have some other trick in other patch.
> > 
> 
> It's to wait for the IO to occur.
> 
1 tick penalty seems too large. I hope we can have some waitqueue in future.



> > > -		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > > +			/*
> > > +			 * The attempt at page out may have made some
> > > +			 * of the pages active, mark them inactive again.
> > > +			 */
> > > +			nr_active = clear_active_flags(&page_list, NULL);
> > > +			count_vm_events(PGDEACTIVATE, nr_active);
> > > +	
> > > +			nr_reclaimed += shrink_page_list(&page_list, sc,
> > > +						PAGEOUT_IO_SYNC, &nr_dirty);
> > > +		}
> > 
> > Just a question. This PAGEOUT_IO_SYNC has some meanings ?
> > 
> 
> Yes, in pageout it will wait on pages currently being written back to be
> cleaned before trying to reclaim them.
> 
Hmm. IIUC, this routine is called only when !current_is_kswapd() and
pageout is done only whne current_is_kswapd(). So, this seems ....
Wrong ?

Thanks,
-Kame





^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-21 23:57                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 177+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-21 23:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Christoph Hellwig, Wu Fengguang, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, 21 Jul 2010 15:27:10 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Wed, Jul 21, 2010 at 09:01:11PM +0900, KAMEZAWA Hiroyuki wrote:
 
> > But, hmm, memcg will have to select to enter this rounine based on
> > the result of 1st memory reclaim.
> > 
> 
> It has the option of igoring pages being dirtied but I worry that the
> container could be filled with dirty pages waiting for flushers to do
> something.

I'll prepare dirty_ratio for memcg. It's not easy but requested by I/O cgroup
guys, too...


> 
> > >  
> > > -		/*
> > > -		 * The attempt at page out may have made some
> > > -		 * of the pages active, mark them inactive again.
> > > -		 */
> > > -		nr_active = clear_active_flags(&page_list, NULL);
> > > -		count_vm_events(PGDEACTIVATE, nr_active);
> > > +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > > +			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> > > +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> > >  
> >
> > Congestion wait is required ?? Where the congestion happens ?
> > I'm sorry you already have some other trick in other patch.
> > 
> 
> It's to wait for the IO to occur.
> 
1 tick penalty seems too large. I hope we can have some waitqueue in future.



> > > -		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > > +			/*
> > > +			 * The attempt at page out may have made some
> > > +			 * of the pages active, mark them inactive again.
> > > +			 */
> > > +			nr_active = clear_active_flags(&page_list, NULL);
> > > +			count_vm_events(PGDEACTIVATE, nr_active);
> > > +	
> > > +			nr_reclaimed += shrink_page_list(&page_list, sc,
> > > +						PAGEOUT_IO_SYNC, &nr_dirty);
> > > +		}
> > 
> > Just a question. This PAGEOUT_IO_SYNC has some meanings ?
> > 
> 
> Yes, in pageout it will wait on pages currently being written back to be
> cleaned before trying to reclaim them.
> 
Hmm. IIUC, this routine is called only when !current_is_kswapd() and
pageout is done only whne current_is_kswapd(). So, this seems ....
Wrong ?

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-19 14:21     ` Christoph Hellwig
@ 2010-07-22  1:13       ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-22  1:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 10:21:45PM +0800, Christoph Hellwig wrote:
> On Mon, Jul 19, 2010 at 02:11:29PM +0100, Mel Gorman wrote:
> > From: Wu Fengguang <fengguang.wu@intel.com>
> > 
> > A background flush work may run for ever. So it's reasonable for it to
> > mimic the kupdate behavior of syncing old/expired inodes first.
> > 
> > This behavior also makes sense from the perspective of page reclaim.
> > File pages are added to the inactive list and promoted if referenced
> > after one recycling. If not referenced, it's very easy for pages to be
> > cleaned from reclaim context which is inefficient in terms of IO. If
> > background flush is cleaning pages, it's best it cleans old pages to
> > help minimise IO from reclaim.
> 
> Yes, we absolutely do this.  Wu, do you have an improved version of the
> pending or should we put it in this version for now?

Sorry for the delay! The code looks a bit hacky, and there is a problem:
it only decrease expire_interval and never increase/reset it.
So it's possible when dirty workload first goes light then goes heavy,
expire_interval may be reduced to 0 and never be able to grow up again.
In the end we revert to the old behavior of ignoring dirtied_when totally.

A more complete solution would be to make use of older_than_this not
only for the kupdate case, but also for the background and sync cases.
The policies can be most cleanly carried out in move_expired_inodes().

- kupdate: older_than_this = jiffies - 30s
- background: older_than_this = TRY FROM (jiffies - 30s) TO (jiffies),
                                UNTIL get some inodes to sync
- sync: older_than_this = start time of sync

I'll post an untested RFC patchset for the kupdate and background
cases. The sync case will need two more patch series due to other
problems.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-22  1:13       ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-22  1:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 10:21:45PM +0800, Christoph Hellwig wrote:
> On Mon, Jul 19, 2010 at 02:11:29PM +0100, Mel Gorman wrote:
> > From: Wu Fengguang <fengguang.wu@intel.com>
> > 
> > A background flush work may run for ever. So it's reasonable for it to
> > mimic the kupdate behavior of syncing old/expired inodes first.
> > 
> > This behavior also makes sense from the perspective of page reclaim.
> > File pages are added to the inactive list and promoted if referenced
> > after one recycling. If not referenced, it's very easy for pages to be
> > cleaned from reclaim context which is inefficient in terms of IO. If
> > background flush is cleaning pages, it's best it cleans old pages to
> > help minimise IO from reclaim.
> 
> Yes, we absolutely do this.  Wu, do you have an improved version of the
> pending or should we put it in this version for now?

Sorry for the delay! The code looks a bit hacky, and there is a problem:
it only decrease expire_interval and never increase/reset it.
So it's possible when dirty workload first goes light then goes heavy,
expire_interval may be reduced to 0 and never be able to grow up again.
In the end we revert to the old behavior of ignoring dirtied_when totally.

A more complete solution would be to make use of older_than_this not
only for the kupdate case, but also for the background and sync cases.
The policies can be most cleanly carried out in move_expired_inodes().

- kupdate: older_than_this = jiffies - 30s
- background: older_than_this = TRY FROM (jiffies - 30s) TO (jiffies),
                                UNTIL get some inodes to sync
- sync: older_than_this = start time of sync

I'll post an untested RFC patchset for the kupdate and background
cases. The sync case will need two more patch series due to other
problems.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-19 14:40       ` Mel Gorman
@ 2010-07-22  8:52         ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-22  8:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli, Minchan Kim

> Some insight on how the other writeback changes that are being floated
> around might affect the number of dirty pages reclaim encounters would also
> be helpful.

Here is an interesting related problem about the wait_on_page_writeback() call
inside shrink_page_list():

        http://lkml.org/lkml/2010/4/4/86

The problem is, wait_on_page_writeback() is called too early in the
direct reclaim path, which blocks many random/unrelated processes when
some slow (USB stick) writeback is on the way.

A simple dd can easily create a big range of dirty pages in the LRU
list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
typical desktop, which triggers the lumpy reclaim mode and hence
wait_on_page_writeback().

I proposed this patch at the time, which was confirmed to solve the problem:

--- linux-next.orig/mm/vmscan.c	2010-06-24 14:32:03.000000000 +0800
+++ linux-next/mm/vmscan.c	2010-07-22 16:12:34.000000000 +0800
@@ -1650,7 +1650,7 @@ static void set_lumpy_reclaim_mode(int p
 	 */
 	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
 		sc->lumpy_reclaim_mode = 1;
-	else if (sc->order && priority < DEF_PRIORITY - 2)
+	else if (sc->order && priority < DEF_PRIORITY / 2)
 		sc->lumpy_reclaim_mode = 1;
 	else
 		sc->lumpy_reclaim_mode = 0;


However KOSAKI and Minchan raised concerns about raising the bar.
I guess this new patch is more problem oriented and acceptable:

--- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
+++ linux-next/mm/vmscan.c	2010-07-22 16:39:57.000000000 +0800
@@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
 			count_vm_events(PGDEACTIVATE, nr_active);
 
 			nr_freed += shrink_page_list(&page_list, sc,
-							PAGEOUT_IO_SYNC);
+					priority < DEF_PRIORITY / 3 ?
+					PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
 		}
 
 		nr_reclaimed += nr_freed;

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-22  8:52         ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-22  8:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli, Minchan Kim

> Some insight on how the other writeback changes that are being floated
> around might affect the number of dirty pages reclaim encounters would also
> be helpful.

Here is an interesting related problem about the wait_on_page_writeback() call
inside shrink_page_list():

        http://lkml.org/lkml/2010/4/4/86

The problem is, wait_on_page_writeback() is called too early in the
direct reclaim path, which blocks many random/unrelated processes when
some slow (USB stick) writeback is on the way.

A simple dd can easily create a big range of dirty pages in the LRU
list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
typical desktop, which triggers the lumpy reclaim mode and hence
wait_on_page_writeback().

I proposed this patch at the time, which was confirmed to solve the problem:

--- linux-next.orig/mm/vmscan.c	2010-06-24 14:32:03.000000000 +0800
+++ linux-next/mm/vmscan.c	2010-07-22 16:12:34.000000000 +0800
@@ -1650,7 +1650,7 @@ static void set_lumpy_reclaim_mode(int p
 	 */
 	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
 		sc->lumpy_reclaim_mode = 1;
-	else if (sc->order && priority < DEF_PRIORITY - 2)
+	else if (sc->order && priority < DEF_PRIORITY / 2)
 		sc->lumpy_reclaim_mode = 1;
 	else
 		sc->lumpy_reclaim_mode = 0;


However KOSAKI and Minchan raised concerns about raising the bar.
I guess this new patch is more problem oriented and acceptable:

--- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
+++ linux-next/mm/vmscan.c	2010-07-22 16:39:57.000000000 +0800
@@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
 			count_vm_events(PGDEACTIVATE, nr_active);
 
 			nr_freed += shrink_page_list(&page_list, sc,
-							PAGEOUT_IO_SYNC);
+					priority < DEF_PRIORITY / 3 ?
+					PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
 		}
 
 		nr_reclaimed += nr_freed;

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-22  8:52         ` Wu Fengguang
@ 2010-07-22  9:02           ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-22  9:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli, Minchan Kim

Sorry, please ignore this hack, it's non sense..

> 
> --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> +++ linux-next/mm/vmscan.c	2010-07-22 16:39:57.000000000 +0800
> @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
>  			count_vm_events(PGDEACTIVATE, nr_active);
>  
>  			nr_freed += shrink_page_list(&page_list, sc,
> -							PAGEOUT_IO_SYNC);
> +					priority < DEF_PRIORITY / 3 ?
> +					PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
>  		}
>  
>  		nr_reclaimed += nr_freed;
 
Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-22  9:02           ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-22  9:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli, Minchan Kim

Sorry, please ignore this hack, it's non sense..

> 
> --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> +++ linux-next/mm/vmscan.c	2010-07-22 16:39:57.000000000 +0800
> @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
>  			count_vm_events(PGDEACTIVATE, nr_active);
>  
>  			nr_freed += shrink_page_list(&page_list, sc,
> -							PAGEOUT_IO_SYNC);
> +					priority < DEF_PRIORITY / 3 ?
> +					PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
>  		}
>  
>  		nr_reclaimed += nr_freed;
 
Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-21 23:57                 ` KAMEZAWA Hiroyuki
@ 2010-07-22  9:19                   ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-22  9:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Christoph Hellwig, Wu Fengguang, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Thu, Jul 22, 2010 at 08:57:34AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 21 Jul 2010 15:27:10 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On Wed, Jul 21, 2010 at 09:01:11PM +0900, KAMEZAWA Hiroyuki wrote:
>  
> > > But, hmm, memcg will have to select to enter this rounine based on
> > > the result of 1st memory reclaim.
> > > 
> > 
> > It has the option of igoring pages being dirtied but I worry that the
> > container could be filled with dirty pages waiting for flushers to do
> > something.
> 
> I'll prepare dirty_ratio for memcg. It's not easy but requested by I/O cgroup
> guys, too...
> 

I can see why it might be difficult. Dirty pages are not being counted
on a per-container basis. It would require additional infrastructure to
count it or a lot of scanning.

> 
> > 
> > > >  
> > > > -		/*
> > > > -		 * The attempt at page out may have made some
> > > > -		 * of the pages active, mark them inactive again.
> > > > -		 */
> > > > -		nr_active = clear_active_flags(&page_list, NULL);
> > > > -		count_vm_events(PGDEACTIVATE, nr_active);
> > > > +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > > > +			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> > > > +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > >  
> > >
> > > Congestion wait is required ?? Where the congestion happens ?
> > > I'm sorry you already have some other trick in other patch.
> > > 
> > 
> > It's to wait for the IO to occur.
> > 
>
> 1 tick penalty seems too large. I hope we can have some waitqueue in future.
> 

congestion_wait() if congestion occurs goes onto a waitqueue that is
woken if congestion clears. I didn't measure it this time around but I
doubt it waits for HZ/10 much of the time.

> > > > -		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > > > +			/*
> > > > +			 * The attempt at page out may have made some
> > > > +			 * of the pages active, mark them inactive again.
> > > > +			 */
> > > > +			nr_active = clear_active_flags(&page_list, NULL);
> > > > +			count_vm_events(PGDEACTIVATE, nr_active);
> > > > +	
> > > > +			nr_reclaimed += shrink_page_list(&page_list, sc,
> > > > +						PAGEOUT_IO_SYNC, &nr_dirty);
> > > > +		}
> > > 
> > > Just a question. This PAGEOUT_IO_SYNC has some meanings ?
> > > 
> > 
> > Yes, in pageout it will wait on pages currently being written back to be
> > cleaned before trying to reclaim them.
> > 
> Hmm. IIUC, this routine is called only when !current_is_kswapd() and
> pageout is done only whne current_is_kswapd(). So, this seems ....
> Wrong ?
> 

Both direct reclaim and kswapd can reach shrink_inactive_list

Direct reclaim
do_try_to_free_pages
  -> shrink_zones
    -> shrink_zone
      -> shrink_list
        -> shrink_inactive list <--- the routine in question

Kswapd
balance_pgdat
  -> shrink_zone
    -> shrink_list
      -> shrink_inactive_list

pageout() is still called by direct reclaim if the page is anon so it
will synchronously wait on those if PAGEOUT_IO_SYNC is set. For either
anon or file pages, if they are being currently written back, they will
be waited on in shrink_page_list() if PAGEOUT_IO_SYNC.

So it still has meaning. Did I miss something?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-22  9:19                   ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-22  9:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Christoph Hellwig, Wu Fengguang, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Thu, Jul 22, 2010 at 08:57:34AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 21 Jul 2010 15:27:10 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On Wed, Jul 21, 2010 at 09:01:11PM +0900, KAMEZAWA Hiroyuki wrote:
>  
> > > But, hmm, memcg will have to select to enter this rounine based on
> > > the result of 1st memory reclaim.
> > > 
> > 
> > It has the option of igoring pages being dirtied but I worry that the
> > container could be filled with dirty pages waiting for flushers to do
> > something.
> 
> I'll prepare dirty_ratio for memcg. It's not easy but requested by I/O cgroup
> guys, too...
> 

I can see why it might be difficult. Dirty pages are not being counted
on a per-container basis. It would require additional infrastructure to
count it or a lot of scanning.

> 
> > 
> > > >  
> > > > -		/*
> > > > -		 * The attempt at page out may have made some
> > > > -		 * of the pages active, mark them inactive again.
> > > > -		 */
> > > > -		nr_active = clear_active_flags(&page_list, NULL);
> > > > -		count_vm_events(PGDEACTIVATE, nr_active);
> > > > +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > > > +			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> > > > +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > >  
> > >
> > > Congestion wait is required ?? Where the congestion happens ?
> > > I'm sorry you already have some other trick in other patch.
> > > 
> > 
> > It's to wait for the IO to occur.
> > 
>
> 1 tick penalty seems too large. I hope we can have some waitqueue in future.
> 

congestion_wait() if congestion occurs goes onto a waitqueue that is
woken if congestion clears. I didn't measure it this time around but I
doubt it waits for HZ/10 much of the time.

> > > > -		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > > > +			/*
> > > > +			 * The attempt at page out may have made some
> > > > +			 * of the pages active, mark them inactive again.
> > > > +			 */
> > > > +			nr_active = clear_active_flags(&page_list, NULL);
> > > > +			count_vm_events(PGDEACTIVATE, nr_active);
> > > > +	
> > > > +			nr_reclaimed += shrink_page_list(&page_list, sc,
> > > > +						PAGEOUT_IO_SYNC, &nr_dirty);
> > > > +		}
> > > 
> > > Just a question. This PAGEOUT_IO_SYNC has some meanings ?
> > > 
> > 
> > Yes, in pageout it will wait on pages currently being written back to be
> > cleaned before trying to reclaim them.
> > 
> Hmm. IIUC, this routine is called only when !current_is_kswapd() and
> pageout is done only whne current_is_kswapd(). So, this seems ....
> Wrong ?
> 

Both direct reclaim and kswapd can reach shrink_inactive_list

Direct reclaim
do_try_to_free_pages
  -> shrink_zones
    -> shrink_zone
      -> shrink_list
        -> shrink_inactive list <--- the routine in question

Kswapd
balance_pgdat
  -> shrink_zone
    -> shrink_list
      -> shrink_inactive_list

pageout() is still called by direct reclaim if the page is anon so it
will synchronously wait on those if PAGEOUT_IO_SYNC is set. For either
anon or file pages, if they are being currently written back, they will
be waited on in shrink_page_list() if PAGEOUT_IO_SYNC.

So it still has meaning. Did I miss something?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-22  8:52         ` Wu Fengguang
@ 2010-07-22  9:21           ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-22  9:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli, Minchan Kim

> I guess this new patch is more problem oriented and acceptable:
> 
> --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> +++ linux-next/mm/vmscan.c	2010-07-22 16:39:57.000000000 +0800
> @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
>  			count_vm_events(PGDEACTIVATE, nr_active);
>  
>  			nr_freed += shrink_page_list(&page_list, sc,
> -							PAGEOUT_IO_SYNC);
> +					priority < DEF_PRIORITY / 3 ?
> +					PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
>  		}
>  
>  		nr_reclaimed += nr_freed;

This one looks better:
---
vmscan: raise the bar to PAGEOUT_IO_SYNC stalls

Fix "system goes totally unresponsive with many dirty/writeback pages"
problem:

	http://lkml.org/lkml/2010/4/4/86

The root cause is, wait_on_page_writeback() is called too early in the
direct reclaim path, which blocks many random/unrelated processes when
some slow (USB stick) writeback is on the way.

A simple dd can easily create a big range of dirty pages in the LRU
list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
typical desktop, which triggers the lumpy reclaim mode and hence
wait_on_page_writeback().

In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to
the 22MB writeback and 190MB dirty pages. There can easily be a
continuous range of 512KB dirty/writeback pages in the LRU, which will
trigger the wait logic.

To make it worse, when there are 50MB writeback pages and USB 1.1 is
writing them in 1MB/s, wait_on_page_writeback() may stuck for up to 50
seconds.

So only enter sync write&wait when priority goes below DEF_PRIORITY/3,
or 6.25% LRU. As the default dirty throttle ratio is 20%, sync write&wait
will hardly be triggered by pure dirty pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/vmscan.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
+++ linux-next/mm/vmscan.c	2010-07-22 17:03:47.000000000 +0800
@@ -1206,7 +1206,7 @@ static unsigned long shrink_inactive_lis
 		 * but that should be acceptable to the caller
 		 */
 		if (nr_freed < nr_taken && !current_is_kswapd() &&
-		    sc->lumpy_reclaim_mode) {
+		    sc->lumpy_reclaim_mode && priority < DEF_PRIORITY / 3) {
 			congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 			/*

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-22  9:21           ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-22  9:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli, Minchan Kim

> I guess this new patch is more problem oriented and acceptable:
> 
> --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> +++ linux-next/mm/vmscan.c	2010-07-22 16:39:57.000000000 +0800
> @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
>  			count_vm_events(PGDEACTIVATE, nr_active);
>  
>  			nr_freed += shrink_page_list(&page_list, sc,
> -							PAGEOUT_IO_SYNC);
> +					priority < DEF_PRIORITY / 3 ?
> +					PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
>  		}
>  
>  		nr_reclaimed += nr_freed;

This one looks better:
---
vmscan: raise the bar to PAGEOUT_IO_SYNC stalls

Fix "system goes totally unresponsive with many dirty/writeback pages"
problem:

	http://lkml.org/lkml/2010/4/4/86

The root cause is, wait_on_page_writeback() is called too early in the
direct reclaim path, which blocks many random/unrelated processes when
some slow (USB stick) writeback is on the way.

A simple dd can easily create a big range of dirty pages in the LRU
list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
typical desktop, which triggers the lumpy reclaim mode and hence
wait_on_page_writeback().

In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to
the 22MB writeback and 190MB dirty pages. There can easily be a
continuous range of 512KB dirty/writeback pages in the LRU, which will
trigger the wait logic.

To make it worse, when there are 50MB writeback pages and USB 1.1 is
writing them in 1MB/s, wait_on_page_writeback() may stuck for up to 50
seconds.

So only enter sync write&wait when priority goes below DEF_PRIORITY/3,
or 6.25% LRU. As the default dirty throttle ratio is 20%, sync write&wait
will hardly be triggered by pure dirty pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/vmscan.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
+++ linux-next/mm/vmscan.c	2010-07-22 17:03:47.000000000 +0800
@@ -1206,7 +1206,7 @@ static unsigned long shrink_inactive_lis
 		 * but that should be acceptable to the caller
 		 */
 		if (nr_freed < nr_taken && !current_is_kswapd() &&
-		    sc->lumpy_reclaim_mode) {
+		    sc->lumpy_reclaim_mode && priority < DEF_PRIORITY / 3) {
 			congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 			/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-22  9:19                   ` Mel Gorman
@ 2010-07-22  9:22                     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 177+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-22  9:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Christoph Hellwig, Wu Fengguang, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Thu, 22 Jul 2010 10:19:30 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Thu, Jul 22, 2010 at 08:57:34AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Wed, 21 Jul 2010 15:27:10 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:

> > 1 tick penalty seems too large. I hope we can have some waitqueue in future.
> > 
> 
> congestion_wait() if congestion occurs goes onto a waitqueue that is
> woken if congestion clears. I didn't measure it this time around but I
> doubt it waits for HZ/10 much of the time.
> 
Okay.

> > > > > -		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > > > > +			/*
> > > > > +			 * The attempt at page out may have made some
> > > > > +			 * of the pages active, mark them inactive again.
> > > > > +			 */
> > > > > +			nr_active = clear_active_flags(&page_list, NULL);
> > > > > +			count_vm_events(PGDEACTIVATE, nr_active);
> > > > > +	
> > > > > +			nr_reclaimed += shrink_page_list(&page_list, sc,
> > > > > +						PAGEOUT_IO_SYNC, &nr_dirty);
> > > > > +		}
> > > > 
> > > > Just a question. This PAGEOUT_IO_SYNC has some meanings ?
> > > > 
> > > 
> > > Yes, in pageout it will wait on pages currently being written back to be
> > > cleaned before trying to reclaim them.
> > > 
> > Hmm. IIUC, this routine is called only when !current_is_kswapd() and
> > pageout is done only whne current_is_kswapd(). So, this seems ....
> > Wrong ?
> > 
> 
> Both direct reclaim and kswapd can reach shrink_inactive_list
> 
> Direct reclaim
> do_try_to_free_pages
>   -> shrink_zones
>     -> shrink_zone
>       -> shrink_list
>         -> shrink_inactive list <--- the routine in question
> 
> Kswapd
> balance_pgdat
>   -> shrink_zone
>     -> shrink_list
>       -> shrink_inactive_list
> 
> pageout() is still called by direct reclaim if the page is anon so it
> will synchronously wait on those if PAGEOUT_IO_SYNC is set. 

Ah, ok. I missed that. Thank you for kindly clarification.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-22  9:22                     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 177+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-22  9:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Christoph Hellwig, Wu Fengguang, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Thu, 22 Jul 2010 10:19:30 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Thu, Jul 22, 2010 at 08:57:34AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Wed, 21 Jul 2010 15:27:10 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:

> > 1 tick penalty seems too large. I hope we can have some waitqueue in future.
> > 
> 
> congestion_wait() if congestion occurs goes onto a waitqueue that is
> woken if congestion clears. I didn't measure it this time around but I
> doubt it waits for HZ/10 much of the time.
> 
Okay.

> > > > > -		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > > > > +			/*
> > > > > +			 * The attempt at page out may have made some
> > > > > +			 * of the pages active, mark them inactive again.
> > > > > +			 */
> > > > > +			nr_active = clear_active_flags(&page_list, NULL);
> > > > > +			count_vm_events(PGDEACTIVATE, nr_active);
> > > > > +	
> > > > > +			nr_reclaimed += shrink_page_list(&page_list, sc,
> > > > > +						PAGEOUT_IO_SYNC, &nr_dirty);
> > > > > +		}
> > > > 
> > > > Just a question. This PAGEOUT_IO_SYNC has some meanings ?
> > > > 
> > > 
> > > Yes, in pageout it will wait on pages currently being written back to be
> > > cleaned before trying to reclaim them.
> > > 
> > Hmm. IIUC, this routine is called only when !current_is_kswapd() and
> > pageout is done only whne current_is_kswapd(). So, this seems ....
> > Wrong ?
> > 
> 
> Both direct reclaim and kswapd can reach shrink_inactive_list
> 
> Direct reclaim
> do_try_to_free_pages
>   -> shrink_zones
>     -> shrink_zone
>       -> shrink_list
>         -> shrink_inactive list <--- the routine in question
> 
> Kswapd
> balance_pgdat
>   -> shrink_zone
>     -> shrink_list
>       -> shrink_inactive_list
> 
> pageout() is still called by direct reclaim if the page is anon so it
> will synchronously wait on those if PAGEOUT_IO_SYNC is set. 

Ah, ok. I missed that. Thank you for kindly clarification.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-22  8:52         ` Wu Fengguang
@ 2010-07-22  9:42           ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-22  9:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli, Minchan Kim

On Thu, Jul 22, 2010 at 04:52:10PM +0800, Wu Fengguang wrote:
> > Some insight on how the other writeback changes that are being floated
> > around might affect the number of dirty pages reclaim encounters would also
> > be helpful.
> 
> Here is an interesting related problem about the wait_on_page_writeback() call
> inside shrink_page_list():
> 
>         http://lkml.org/lkml/2010/4/4/86
> 
> The problem is, wait_on_page_writeback() is called too early in the
> direct reclaim path, which blocks many random/unrelated processes when
> some slow (USB stick) writeback is on the way.
> 
> A simple dd can easily create a big range of dirty pages in the LRU
> list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
> typical desktop, which triggers the lumpy reclaim mode and hence
> wait_on_page_writeback().
> 

Lumpy reclaim is for high-order allocations. A simple dd should not be
triggering it regularly unless there was a lot of forking going on at the
same time. Also, how would a random or unrelated process get blocked on
writeback unless they were also doing high-order allocations?  What was the
source of the high-order allocations?

> I proposed this patch at the time, which was confirmed to solve the problem:
> 
> --- linux-next.orig/mm/vmscan.c	2010-06-24 14:32:03.000000000 +0800
> +++ linux-next/mm/vmscan.c	2010-07-22 16:12:34.000000000 +0800
> @@ -1650,7 +1650,7 @@ static void set_lumpy_reclaim_mode(int p
>  	 */
>  	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
>  		sc->lumpy_reclaim_mode = 1;
> -	else if (sc->order && priority < DEF_PRIORITY - 2)
> +	else if (sc->order && priority < DEF_PRIORITY / 2)
>  		sc->lumpy_reclaim_mode = 1;
>  	else
>  		sc->lumpy_reclaim_mode = 0;
> 
> 
> However KOSAKI and Minchan raised concerns about raising the bar.
> I guess this new patch is more problem oriented and acceptable:
> 
> --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> +++ linux-next/mm/vmscan.c	2010-07-22 16:39:57.000000000 +0800
> @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
>  			count_vm_events(PGDEACTIVATE, nr_active);
>  
>  			nr_freed += shrink_page_list(&page_list, sc,
> -							PAGEOUT_IO_SYNC);
> +					priority < DEF_PRIORITY / 3 ?
> +					PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
>  		}
>  

I'm not seeing how this helps. It delays when lumpy reclaim waits on IO
to clean contiguous ranges of pages.

I'll read that full thread as I wasn't aware of it before.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-22  9:42           ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-22  9:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli, Minchan Kim

On Thu, Jul 22, 2010 at 04:52:10PM +0800, Wu Fengguang wrote:
> > Some insight on how the other writeback changes that are being floated
> > around might affect the number of dirty pages reclaim encounters would also
> > be helpful.
> 
> Here is an interesting related problem about the wait_on_page_writeback() call
> inside shrink_page_list():
> 
>         http://lkml.org/lkml/2010/4/4/86
> 
> The problem is, wait_on_page_writeback() is called too early in the
> direct reclaim path, which blocks many random/unrelated processes when
> some slow (USB stick) writeback is on the way.
> 
> A simple dd can easily create a big range of dirty pages in the LRU
> list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
> typical desktop, which triggers the lumpy reclaim mode and hence
> wait_on_page_writeback().
> 

Lumpy reclaim is for high-order allocations. A simple dd should not be
triggering it regularly unless there was a lot of forking going on at the
same time. Also, how would a random or unrelated process get blocked on
writeback unless they were also doing high-order allocations?  What was the
source of the high-order allocations?

> I proposed this patch at the time, which was confirmed to solve the problem:
> 
> --- linux-next.orig/mm/vmscan.c	2010-06-24 14:32:03.000000000 +0800
> +++ linux-next/mm/vmscan.c	2010-07-22 16:12:34.000000000 +0800
> @@ -1650,7 +1650,7 @@ static void set_lumpy_reclaim_mode(int p
>  	 */
>  	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
>  		sc->lumpy_reclaim_mode = 1;
> -	else if (sc->order && priority < DEF_PRIORITY - 2)
> +	else if (sc->order && priority < DEF_PRIORITY / 2)
>  		sc->lumpy_reclaim_mode = 1;
>  	else
>  		sc->lumpy_reclaim_mode = 0;
> 
> 
> However KOSAKI and Minchan raised concerns about raising the bar.
> I guess this new patch is more problem oriented and acceptable:
> 
> --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> +++ linux-next/mm/vmscan.c	2010-07-22 16:39:57.000000000 +0800
> @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
>  			count_vm_events(PGDEACTIVATE, nr_active);
>  
>  			nr_freed += shrink_page_list(&page_list, sc,
> -							PAGEOUT_IO_SYNC);
> +					priority < DEF_PRIORITY / 3 ?
> +					PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
>  		}
>  

I'm not seeing how this helps. It delays when lumpy reclaim waits on IO
to clean contiguous ranges of pages.

I'll read that full thread as I wasn't aware of it before.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-22  9:21           ` Wu Fengguang
@ 2010-07-22 10:48             ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-22 10:48 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli, Minchan Kim

On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote:
> > I guess this new patch is more problem oriented and acceptable:
> > 
> > --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> > +++ linux-next/mm/vmscan.c	2010-07-22 16:39:57.000000000 +0800
> > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
> >  			count_vm_events(PGDEACTIVATE, nr_active);
> >  
> >  			nr_freed += shrink_page_list(&page_list, sc,
> > -							PAGEOUT_IO_SYNC);
> > +					priority < DEF_PRIORITY / 3 ?
> > +					PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
> >  		}
> >  
> >  		nr_reclaimed += nr_freed;
> 
> This one looks better:
> ---
> vmscan: raise the bar to PAGEOUT_IO_SYNC stalls
> 
> Fix "system goes totally unresponsive with many dirty/writeback pages"
> problem:
> 
> 	http://lkml.org/lkml/2010/4/4/86
> 
> The root cause is, wait_on_page_writeback() is called too early in the
> direct reclaim path, which blocks many random/unrelated processes when
> some slow (USB stick) writeback is on the way.
> 

So, what's the bet if lumpy reclaim is a factor that it's
high-order-but-low-cost such as fork() that are getting caught by this since
[78dc583d: vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC]
was introduced?

That could manifest to the user as stalls creating new processes when under
heavy IO. I would be surprised it would freeze the entire system but certainly
any new work would feel very slow.

> A simple dd can easily create a big range of dirty pages in the LRU
> list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
> typical desktop, which triggers the lumpy reclaim mode and hence
> wait_on_page_writeback().
> 

which triggers the lumpy reclaim mode for high-order allocations.

lumpy reclaim mode is not something that is triggered just because priority
is high.

I think there is a second possibility for causing stalls as well that is
unrelated to lumpy reclaim. Once dirty_limit is reached, new page faults may
also result in stalls. If it is taking a long time to writeback dirty data,
random processes could be getting stalled just because they happened to dirty
data at the wrong time.  This would be the case if the main dirtying process
(e.g. dd) is not calling sync and dropping pages it's no longer using.

> In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to
> the 22MB writeback and 190MB dirty pages. There can easily be a
> continuous range of 512KB dirty/writeback pages in the LRU, which will
> trigger the wait logic.
> 
> To make it worse, when there are 50MB writeback pages and USB 1.1 is
> writing them in 1MB/s, wait_on_page_writeback() may stuck for up to 50
> seconds.
> 
> So only enter sync write&wait when priority goes below DEF_PRIORITY/3,
> or 6.25% LRU. As the default dirty throttle ratio is 20%, sync write&wait
> will hardly be triggered by pure dirty pages.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/vmscan.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> +++ linux-next/mm/vmscan.c	2010-07-22 17:03:47.000000000 +0800
> @@ -1206,7 +1206,7 @@ static unsigned long shrink_inactive_lis
>  		 * but that should be acceptable to the caller
>  		 */
>  		if (nr_freed < nr_taken && !current_is_kswapd() &&
> -		    sc->lumpy_reclaim_mode) {
> +		    sc->lumpy_reclaim_mode && priority < DEF_PRIORITY / 3) {
>  			congestion_wait(BLK_RW_ASYNC, HZ/10);
>  

This will also delay waiting on congestion for really high-order
allocations such as huge pages, some video decoder and the like which
really should be stalling. How about the following compile-tested diff?
It takes the cost of the high-order allocation into account and the
priority when deciding whether to synchronously wait or not.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c7e57c..d652e0c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1110,6 +1110,48 @@ static int too_many_isolated(struct zone *zone, int file,
 }
 
 /*
+ * Returns true if the caller should stall on congestion and retry to clean
+ * the list of pages synchronously.
+ *
+ * If we are direct reclaiming for contiguous pages and we do not reclaim
+ * everything in the list, try again and wait for IO to complete. This
+ * will stall high-order allocations but that should be acceptable to
+ * the caller
+ */
+static inline bool should_reclaim_stall(unsigned long nr_taken,
+				unsigned long nr_freed,
+				int priority,
+				struct scan_control *sc)
+{
+	int lumpy_stall_priority;
+
+	/* kswapd should not stall on sync IO */
+	if (current_is_kswapd())
+		return false;
+
+	/* Only stall on lumpy reclaim */
+	if (!sc->lumpy_reclaim_mode)
+		return false;
+
+	/* If we have relaimed everything on the isolated list, no stall */
+	if (nr_freed == nr_taken)
+		return false;
+
+	/*
+	 * For high-order allocations, there are two stall thresholds.
+	 * High-cost allocations stall immediately where as lower
+	 * order allocations such as stacks require the scanning
+	 * priority to be much higher before stalling
+	 */
+	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+		lumpy_stall_priority = DEF_PRIORITY;
+	else
+		lumpy_stall_priority = DEF_PRIORITY / 3;
+
+	return priority <= lumpy_stall_priority;
+}
+
+/*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
  */
@@ -1199,14 +1241,8 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 		nr_scanned += nr_scan;
 		nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
 
-		/*
-		 * If we are direct reclaiming for contiguous pages and we do
-		 * not reclaim everything in the list, try again and wait
-		 * for IO to complete. This will stall high-order allocations
-		 * but that should be acceptable to the caller
-		 */
-		if (nr_freed < nr_taken && !current_is_kswapd() &&
-		    sc->lumpy_reclaim_mode) {
+		/* Check if we should syncronously wait for writeback */
+		if (should_reclaim_stall(nr_taken, nr_freed, priority, sc)) {
 			congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 			/*




^ permalink raw reply related	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-22 10:48             ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-22 10:48 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli, Minchan Kim

On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote:
> > I guess this new patch is more problem oriented and acceptable:
> > 
> > --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> > +++ linux-next/mm/vmscan.c	2010-07-22 16:39:57.000000000 +0800
> > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
> >  			count_vm_events(PGDEACTIVATE, nr_active);
> >  
> >  			nr_freed += shrink_page_list(&page_list, sc,
> > -							PAGEOUT_IO_SYNC);
> > +					priority < DEF_PRIORITY / 3 ?
> > +					PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
> >  		}
> >  
> >  		nr_reclaimed += nr_freed;
> 
> This one looks better:
> ---
> vmscan: raise the bar to PAGEOUT_IO_SYNC stalls
> 
> Fix "system goes totally unresponsive with many dirty/writeback pages"
> problem:
> 
> 	http://lkml.org/lkml/2010/4/4/86
> 
> The root cause is, wait_on_page_writeback() is called too early in the
> direct reclaim path, which blocks many random/unrelated processes when
> some slow (USB stick) writeback is on the way.
> 

So, what's the bet if lumpy reclaim is a factor that it's
high-order-but-low-cost such as fork() that are getting caught by this since
[78dc583d: vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC]
was introduced?

That could manifest to the user as stalls creating new processes when under
heavy IO. I would be surprised it would freeze the entire system but certainly
any new work would feel very slow.

> A simple dd can easily create a big range of dirty pages in the LRU
> list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
> typical desktop, which triggers the lumpy reclaim mode and hence
> wait_on_page_writeback().
> 

which triggers the lumpy reclaim mode for high-order allocations.

lumpy reclaim mode is not something that is triggered just because priority
is high.

I think there is a second possibility for causing stalls as well that is
unrelated to lumpy reclaim. Once dirty_limit is reached, new page faults may
also result in stalls. If it is taking a long time to writeback dirty data,
random processes could be getting stalled just because they happened to dirty
data at the wrong time.  This would be the case if the main dirtying process
(e.g. dd) is not calling sync and dropping pages it's no longer using.

> In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to
> the 22MB writeback and 190MB dirty pages. There can easily be a
> continuous range of 512KB dirty/writeback pages in the LRU, which will
> trigger the wait logic.
> 
> To make it worse, when there are 50MB writeback pages and USB 1.1 is
> writing them in 1MB/s, wait_on_page_writeback() may stuck for up to 50
> seconds.
> 
> So only enter sync write&wait when priority goes below DEF_PRIORITY/3,
> or 6.25% LRU. As the default dirty throttle ratio is 20%, sync write&wait
> will hardly be triggered by pure dirty pages.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/vmscan.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> +++ linux-next/mm/vmscan.c	2010-07-22 17:03:47.000000000 +0800
> @@ -1206,7 +1206,7 @@ static unsigned long shrink_inactive_lis
>  		 * but that should be acceptable to the caller
>  		 */
>  		if (nr_freed < nr_taken && !current_is_kswapd() &&
> -		    sc->lumpy_reclaim_mode) {
> +		    sc->lumpy_reclaim_mode && priority < DEF_PRIORITY / 3) {
>  			congestion_wait(BLK_RW_ASYNC, HZ/10);
>  

This will also delay waiting on congestion for really high-order
allocations such as huge pages, some video decoder and the like which
really should be stalling. How about the following compile-tested diff?
It takes the cost of the high-order allocation into account and the
priority when deciding whether to synchronously wait or not.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c7e57c..d652e0c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1110,6 +1110,48 @@ static int too_many_isolated(struct zone *zone, int file,
 }
 
 /*
+ * Returns true if the caller should stall on congestion and retry to clean
+ * the list of pages synchronously.
+ *
+ * If we are direct reclaiming for contiguous pages and we do not reclaim
+ * everything in the list, try again and wait for IO to complete. This
+ * will stall high-order allocations but that should be acceptable to
+ * the caller
+ */
+static inline bool should_reclaim_stall(unsigned long nr_taken,
+				unsigned long nr_freed,
+				int priority,
+				struct scan_control *sc)
+{
+	int lumpy_stall_priority;
+
+	/* kswapd should not stall on sync IO */
+	if (current_is_kswapd())
+		return false;
+
+	/* Only stall on lumpy reclaim */
+	if (!sc->lumpy_reclaim_mode)
+		return false;
+
+	/* If we have relaimed everything on the isolated list, no stall */
+	if (nr_freed == nr_taken)
+		return false;
+
+	/*
+	 * For high-order allocations, there are two stall thresholds.
+	 * High-cost allocations stall immediately where as lower
+	 * order allocations such as stacks require the scanning
+	 * priority to be much higher before stalling
+	 */
+	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+		lumpy_stall_priority = DEF_PRIORITY;
+	else
+		lumpy_stall_priority = DEF_PRIORITY / 3;
+
+	return priority <= lumpy_stall_priority;
+}
+
+/*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
  */
@@ -1199,14 +1241,8 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 		nr_scanned += nr_scan;
 		nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
 
-		/*
-		 * If we are direct reclaiming for contiguous pages and we do
-		 * not reclaim everything in the list, try again and wait
-		 * for IO to complete. This will stall high-order allocations
-		 * but that should be acceptable to the caller
-		 */
-		if (nr_freed < nr_taken && !current_is_kswapd() &&
-		    sc->lumpy_reclaim_mode) {
+		/* Check if we should syncronously wait for writeback */
+		if (should_reclaim_stall(nr_taken, nr_freed, priority, sc)) {
 			congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 			/*



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-22  9:21           ` Wu Fengguang
@ 2010-07-22 15:34             ` Minchan Kim
  -1 siblings, 0 replies; 177+ messages in thread
From: Minchan Kim @ 2010-07-22 15:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Mel Gorman, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli

Hi, Wu. 
Thanks for Cced me. 

AFAIR, we discussed this by private mail and didn't conclude yet. 
Let's start from beginning. 

On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote:
> > I guess this new patch is more problem oriented and acceptable:
> > 
> > --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> > +++ linux-next/mm/vmscan.c	2010-07-22 16:39:57.000000000 +0800
> > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
> >  			count_vm_events(PGDEACTIVATE, nr_active);
> >  
> >  			nr_freed += shrink_page_list(&page_list, sc,
> > -							PAGEOUT_IO_SYNC);
> > +					priority < DEF_PRIORITY / 3 ?
> > +					PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
> >  		}
> >  
> >  		nr_reclaimed += nr_freed;
> 
> This one looks better:
> ---
> vmscan: raise the bar to PAGEOUT_IO_SYNC stalls
> 
> Fix "system goes totally unresponsive with many dirty/writeback pages"
> problem:
> 
> 	http://lkml.org/lkml/2010/4/4/86
> 
> The root cause is, wait_on_page_writeback() is called too early in the
> direct reclaim path, which blocks many random/unrelated processes when
> some slow (USB stick) writeback is on the way.
> 
> A simple dd can easily create a big range of dirty pages in the LRU
> list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
> typical desktop, which triggers the lumpy reclaim mode and hence
> wait_on_page_writeback().

I see oom message. order is zero. 
How is lumpy reclaim work?
For working lumpy reclaim, we have to meet priority < 10 and sc->order > 0.

Please, clarify the problem.

> 
> In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to
> the 22MB writeback and 190MB dirty pages. There can easily be a

What's 22MB and 190M?
It would be better to explain more detail. 
I think the description has to be clear as summary of the problem 
without the above link. 

Thanks for taking out this problem, again. :)
-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-22 15:34             ` Minchan Kim
  0 siblings, 0 replies; 177+ messages in thread
From: Minchan Kim @ 2010-07-22 15:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Mel Gorman, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli

Hi, Wu. 
Thanks for Cced me. 

AFAIR, we discussed this by private mail and didn't conclude yet. 
Let's start from beginning. 

On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote:
> > I guess this new patch is more problem oriented and acceptable:
> > 
> > --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> > +++ linux-next/mm/vmscan.c	2010-07-22 16:39:57.000000000 +0800
> > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
> >  			count_vm_events(PGDEACTIVATE, nr_active);
> >  
> >  			nr_freed += shrink_page_list(&page_list, sc,
> > -							PAGEOUT_IO_SYNC);
> > +					priority < DEF_PRIORITY / 3 ?
> > +					PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
> >  		}
> >  
> >  		nr_reclaimed += nr_freed;
> 
> This one looks better:
> ---
> vmscan: raise the bar to PAGEOUT_IO_SYNC stalls
> 
> Fix "system goes totally unresponsive with many dirty/writeback pages"
> problem:
> 
> 	http://lkml.org/lkml/2010/4/4/86
> 
> The root cause is, wait_on_page_writeback() is called too early in the
> direct reclaim path, which blocks many random/unrelated processes when
> some slow (USB stick) writeback is on the way.
> 
> A simple dd can easily create a big range of dirty pages in the LRU
> list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
> typical desktop, which triggers the lumpy reclaim mode and hence
> wait_on_page_writeback().

I see oom message. order is zero. 
How is lumpy reclaim work?
For working lumpy reclaim, we have to meet priority < 10 and sc->order > 0.

Please, clarify the problem.

> 
> In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to
> the 22MB writeback and 190MB dirty pages. There can easily be a

What's 22MB and 190M?
It would be better to explain more detail. 
I think the description has to be clear as summary of the problem 
without the above link. 

Thanks for taking out this problem, again. :)
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-22  9:42           ` Mel Gorman
@ 2010-07-23  8:33             ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-23  8:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli, Minchan Kim

Hi Mel,

On Thu, Jul 22, 2010 at 05:42:09PM +0800, Mel Gorman wrote:
> On Thu, Jul 22, 2010 at 04:52:10PM +0800, Wu Fengguang wrote:
> > > Some insight on how the other writeback changes that are being floated
> > > around might affect the number of dirty pages reclaim encounters would also
> > > be helpful.
> > 
> > Here is an interesting related problem about the wait_on_page_writeback() call
> > inside shrink_page_list():
> > 
> >         http://lkml.org/lkml/2010/4/4/86

I guess you've got the answers from the above thread, anyway here is
the brief answers to your questions.

> > 
> > The problem is, wait_on_page_writeback() is called too early in the
> > direct reclaim path, which blocks many random/unrelated processes when
> > some slow (USB stick) writeback is on the way.
> > 
> > A simple dd can easily create a big range of dirty pages in the LRU
> > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
> > typical desktop, which triggers the lumpy reclaim mode and hence
> > wait_on_page_writeback().
> > 
> 
> Lumpy reclaim is for high-order allocations. A simple dd should not be
> triggering it regularly unless there was a lot of forking going on at the
> same time.

dd could create the dirty file fast enough, so that no other processes 
are injecting pages into the LRU lists besides dd itself. So it's
creating a large range of hard-to-reclaim LRU pages which will trigger
this code

+       else if (sc->order && priority < DEF_PRIORITY - 2)
+               lumpy_reclaim = 1;


> Also, how would a random or unrelated process get blocked on
> writeback unless they were also doing high-order allocations?  What was the
> source of the high-order allocations?

sc->order is 1 on fork().

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-23  8:33             ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-23  8:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli, Minchan Kim

Hi Mel,

On Thu, Jul 22, 2010 at 05:42:09PM +0800, Mel Gorman wrote:
> On Thu, Jul 22, 2010 at 04:52:10PM +0800, Wu Fengguang wrote:
> > > Some insight on how the other writeback changes that are being floated
> > > around might affect the number of dirty pages reclaim encounters would also
> > > be helpful.
> > 
> > Here is an interesting related problem about the wait_on_page_writeback() call
> > inside shrink_page_list():
> > 
> >         http://lkml.org/lkml/2010/4/4/86

I guess you've got the answers from the above thread, anyway here is
the brief answers to your questions.

> > 
> > The problem is, wait_on_page_writeback() is called too early in the
> > direct reclaim path, which blocks many random/unrelated processes when
> > some slow (USB stick) writeback is on the way.
> > 
> > A simple dd can easily create a big range of dirty pages in the LRU
> > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
> > typical desktop, which triggers the lumpy reclaim mode and hence
> > wait_on_page_writeback().
> > 
> 
> Lumpy reclaim is for high-order allocations. A simple dd should not be
> triggering it regularly unless there was a lot of forking going on at the
> same time.

dd could create the dirty file fast enough, so that no other processes 
are injecting pages into the LRU lists besides dd itself. So it's
creating a large range of hard-to-reclaim LRU pages which will trigger
this code

+       else if (sc->order && priority < DEF_PRIORITY - 2)
+               lumpy_reclaim = 1;


> Also, how would a random or unrelated process get blocked on
> writeback unless they were also doing high-order allocations?  What was the
> source of the high-order allocations?

sc->order is 1 on fork().

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-22 10:48             ` Mel Gorman
@ 2010-07-23  9:45               ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-23  9:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli, Minchan Kim

On Thu, Jul 22, 2010 at 06:48:23PM +0800, Mel Gorman wrote:
> On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote:
> > > I guess this new patch is more problem oriented and acceptable:
> > > 
> > > --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> > > +++ linux-next/mm/vmscan.c	2010-07-22 16:39:57.000000000 +0800
> > > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
> > >  			count_vm_events(PGDEACTIVATE, nr_active);
> > >  
> > >  			nr_freed += shrink_page_list(&page_list, sc,
> > > -							PAGEOUT_IO_SYNC);
> > > +					priority < DEF_PRIORITY / 3 ?
> > > +					PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
> > >  		}
> > >  
> > >  		nr_reclaimed += nr_freed;
> > 
> > This one looks better:
> > ---
> > vmscan: raise the bar to PAGEOUT_IO_SYNC stalls
> > 
> > Fix "system goes totally unresponsive with many dirty/writeback pages"
> > problem:
> > 
> > 	http://lkml.org/lkml/2010/4/4/86
> > 
> > The root cause is, wait_on_page_writeback() is called too early in the
> > direct reclaim path, which blocks many random/unrelated processes when
> > some slow (USB stick) writeback is on the way.
> > 
> 
> So, what's the bet if lumpy reclaim is a factor that it's
> high-order-but-low-cost such as fork() that are getting caught by this since
> [78dc583d: vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC]
> was introduced?

Sorry I'm a bit confused by your wording..

> That could manifest to the user as stalls creating new processes when under
> heavy IO. I would be surprised it would freeze the entire system but certainly
> any new work would feel very slow.
> 
> > A simple dd can easily create a big range of dirty pages in the LRU
> > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
> > typical desktop, which triggers the lumpy reclaim mode and hence
> > wait_on_page_writeback().
> > 
> 
> which triggers the lumpy reclaim mode for high-order allocations.

Exactly. Changelog updated.

> lumpy reclaim mode is not something that is triggered just because priority
> is high.

Right.

> I think there is a second possibility for causing stalls as well that is
> unrelated to lumpy reclaim. Once dirty_limit is reached, new page faults may
> also result in stalls. If it is taking a long time to writeback dirty data,
> random processes could be getting stalled just because they happened to dirty
> data at the wrong time.  This would be the case if the main dirtying process
> (e.g. dd) is not calling sync and dropping pages it's no longer using.

The dirty_limit throttling will slow down the dirty process to the
writeback throughput. If a process is dirtying files on sda (HDD),
it will be throttled at 80MB/s. If another process is dirtying files
on sdb (USB 1.1), it will be throttled at 1MB/s.

So dirty throttling will slow things down. However the slow down
should be smooth (a series of 100ms stalls instead of a sudden 10s
stall), and won't impact random processes (which does no read/write IO
at all).

> > In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to
> > the 22MB writeback and 190MB dirty pages. There can easily be a
> > continuous range of 512KB dirty/writeback pages in the LRU, which will
> > trigger the wait logic.
> > 
> > To make it worse, when there are 50MB writeback pages and USB 1.1 is
> > writing them in 1MB/s, wait_on_page_writeback() may stuck for up to 50
> > seconds.
> > 
> > So only enter sync write&wait when priority goes below DEF_PRIORITY/3,
> > or 6.25% LRU. As the default dirty throttle ratio is 20%, sync write&wait
> > will hardly be triggered by pure dirty pages.
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/vmscan.c |    4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> > +++ linux-next/mm/vmscan.c	2010-07-22 17:03:47.000000000 +0800
> > @@ -1206,7 +1206,7 @@ static unsigned long shrink_inactive_lis
> >  		 * but that should be acceptable to the caller
> >  		 */
> >  		if (nr_freed < nr_taken && !current_is_kswapd() &&
> > -		    sc->lumpy_reclaim_mode) {
> > +		    sc->lumpy_reclaim_mode && priority < DEF_PRIORITY / 3) {
> >  			congestion_wait(BLK_RW_ASYNC, HZ/10);
> >  
> 
> This will also delay waiting on congestion for really high-order
> allocations such as huge pages, some video decoder and the like which
> really should be stalling.

I absolutely agree that high order allocators should be somehow throttled.

However given that one can easily create a large _continuous_ range of
dirty LRU pages, let someone bumping all the way through the range
sounds a bit cruel..

> How about the following compile-tested diff?
> It takes the cost of the high-order allocation into account and the
> priority when deciding whether to synchronously wait or not.

Very nice patch. Thanks!

Cheers,
Fengguang

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9c7e57c..d652e0c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1110,6 +1110,48 @@ static int too_many_isolated(struct zone *zone, int file,
>  }
>  
>  /*
> + * Returns true if the caller should stall on congestion and retry to clean
> + * the list of pages synchronously.
> + *
> + * If we are direct reclaiming for contiguous pages and we do not reclaim
> + * everything in the list, try again and wait for IO to complete. This
> + * will stall high-order allocations but that should be acceptable to
> + * the caller
> + */
> +static inline bool should_reclaim_stall(unsigned long nr_taken,
> +				unsigned long nr_freed,
> +				int priority,
> +				struct scan_control *sc)
> +{
> +	int lumpy_stall_priority;
> +
> +	/* kswapd should not stall on sync IO */
> +	if (current_is_kswapd())
> +		return false;
> +
> +	/* Only stall on lumpy reclaim */
> +	if (!sc->lumpy_reclaim_mode)
> +		return false;
> +
> +	/* If we have relaimed everything on the isolated list, no stall */
> +	if (nr_freed == nr_taken)
> +		return false;
> +
> +	/*
> +	 * For high-order allocations, there are two stall thresholds.
> +	 * High-cost allocations stall immediately where as lower
> +	 * order allocations such as stacks require the scanning
> +	 * priority to be much higher before stalling
> +	 */
> +	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> +		lumpy_stall_priority = DEF_PRIORITY;
> +	else
> +		lumpy_stall_priority = DEF_PRIORITY / 3;
> +
> +	return priority <= lumpy_stall_priority;
> +}
> +
> +/*
>   * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
>   * of reclaimed pages
>   */
> @@ -1199,14 +1241,8 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
>  		nr_scanned += nr_scan;
>  		nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
>  
> -		/*
> -		 * If we are direct reclaiming for contiguous pages and we do
> -		 * not reclaim everything in the list, try again and wait
> -		 * for IO to complete. This will stall high-order allocations
> -		 * but that should be acceptable to the caller
> -		 */
> -		if (nr_freed < nr_taken && !current_is_kswapd() &&
> -		    sc->lumpy_reclaim_mode) {
> +		/* Check if we should syncronously wait for writeback */
> +		if (should_reclaim_stall(nr_taken, nr_freed, priority, sc)) {
>  			congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  			/*
> 
> 

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-23  9:45               ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-23  9:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli, Minchan Kim

On Thu, Jul 22, 2010 at 06:48:23PM +0800, Mel Gorman wrote:
> On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote:
> > > I guess this new patch is more problem oriented and acceptable:
> > > 
> > > --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> > > +++ linux-next/mm/vmscan.c	2010-07-22 16:39:57.000000000 +0800
> > > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
> > >  			count_vm_events(PGDEACTIVATE, nr_active);
> > >  
> > >  			nr_freed += shrink_page_list(&page_list, sc,
> > > -							PAGEOUT_IO_SYNC);
> > > +					priority < DEF_PRIORITY / 3 ?
> > > +					PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
> > >  		}
> > >  
> > >  		nr_reclaimed += nr_freed;
> > 
> > This one looks better:
> > ---
> > vmscan: raise the bar to PAGEOUT_IO_SYNC stalls
> > 
> > Fix "system goes totally unresponsive with many dirty/writeback pages"
> > problem:
> > 
> > 	http://lkml.org/lkml/2010/4/4/86
> > 
> > The root cause is, wait_on_page_writeback() is called too early in the
> > direct reclaim path, which blocks many random/unrelated processes when
> > some slow (USB stick) writeback is on the way.
> > 
> 
> So, what's the bet if lumpy reclaim is a factor that it's
> high-order-but-low-cost such as fork() that are getting caught by this since
> [78dc583d: vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC]
> was introduced?

Sorry I'm a bit confused by your wording..

> That could manifest to the user as stalls creating new processes when under
> heavy IO. I would be surprised it would freeze the entire system but certainly
> any new work would feel very slow.
> 
> > A simple dd can easily create a big range of dirty pages in the LRU
> > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
> > typical desktop, which triggers the lumpy reclaim mode and hence
> > wait_on_page_writeback().
> > 
> 
> which triggers the lumpy reclaim mode for high-order allocations.

Exactly. Changelog updated.

> lumpy reclaim mode is not something that is triggered just because priority
> is high.

Right.

> I think there is a second possibility for causing stalls as well that is
> unrelated to lumpy reclaim. Once dirty_limit is reached, new page faults may
> also result in stalls. If it is taking a long time to writeback dirty data,
> random processes could be getting stalled just because they happened to dirty
> data at the wrong time.  This would be the case if the main dirtying process
> (e.g. dd) is not calling sync and dropping pages it's no longer using.

The dirty_limit throttling will slow down the dirty process to the
writeback throughput. If a process is dirtying files on sda (HDD),
it will be throttled at 80MB/s. If another process is dirtying files
on sdb (USB 1.1), it will be throttled at 1MB/s.

So dirty throttling will slow things down. However the slow down
should be smooth (a series of 100ms stalls instead of a sudden 10s
stall), and won't impact random processes (which does no read/write IO
at all).

> > In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to
> > the 22MB writeback and 190MB dirty pages. There can easily be a
> > continuous range of 512KB dirty/writeback pages in the LRU, which will
> > trigger the wait logic.
> > 
> > To make it worse, when there are 50MB writeback pages and USB 1.1 is
> > writing them in 1MB/s, wait_on_page_writeback() may stuck for up to 50
> > seconds.
> > 
> > So only enter sync write&wait when priority goes below DEF_PRIORITY/3,
> > or 6.25% LRU. As the default dirty throttle ratio is 20%, sync write&wait
> > will hardly be triggered by pure dirty pages.
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/vmscan.c |    4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> > +++ linux-next/mm/vmscan.c	2010-07-22 17:03:47.000000000 +0800
> > @@ -1206,7 +1206,7 @@ static unsigned long shrink_inactive_lis
> >  		 * but that should be acceptable to the caller
> >  		 */
> >  		if (nr_freed < nr_taken && !current_is_kswapd() &&
> > -		    sc->lumpy_reclaim_mode) {
> > +		    sc->lumpy_reclaim_mode && priority < DEF_PRIORITY / 3) {
> >  			congestion_wait(BLK_RW_ASYNC, HZ/10);
> >  
> 
> This will also delay waiting on congestion for really high-order
> allocations such as huge pages, some video decoder and the like which
> really should be stalling.

I absolutely agree that high order allocators should be somehow throttled.

However given that one can easily create a large _continuous_ range of
dirty LRU pages, let someone bumping all the way through the range
sounds a bit cruel..

> How about the following compile-tested diff?
> It takes the cost of the high-order allocation into account and the
> priority when deciding whether to synchronously wait or not.

Very nice patch. Thanks!

Cheers,
Fengguang

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9c7e57c..d652e0c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1110,6 +1110,48 @@ static int too_many_isolated(struct zone *zone, int file,
>  }
>  
>  /*
> + * Returns true if the caller should stall on congestion and retry to clean
> + * the list of pages synchronously.
> + *
> + * If we are direct reclaiming for contiguous pages and we do not reclaim
> + * everything in the list, try again and wait for IO to complete. This
> + * will stall high-order allocations but that should be acceptable to
> + * the caller
> + */
> +static inline bool should_reclaim_stall(unsigned long nr_taken,
> +				unsigned long nr_freed,
> +				int priority,
> +				struct scan_control *sc)
> +{
> +	int lumpy_stall_priority;
> +
> +	/* kswapd should not stall on sync IO */
> +	if (current_is_kswapd())
> +		return false;
> +
> +	/* Only stall on lumpy reclaim */
> +	if (!sc->lumpy_reclaim_mode)
> +		return false;
> +
> +	/* If we have relaimed everything on the isolated list, no stall */
> +	if (nr_freed == nr_taken)
> +		return false;
> +
> +	/*
> +	 * For high-order allocations, there are two stall thresholds.
> +	 * High-cost allocations stall immediately where as lower
> +	 * order allocations such as stacks require the scanning
> +	 * priority to be much higher before stalling
> +	 */
> +	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> +		lumpy_stall_priority = DEF_PRIORITY;
> +	else
> +		lumpy_stall_priority = DEF_PRIORITY / 3;
> +
> +	return priority <= lumpy_stall_priority;
> +}
> +
> +/*
>   * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
>   * of reclaimed pages
>   */
> @@ -1199,14 +1241,8 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
>  		nr_scanned += nr_scan;
>  		nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
>  
> -		/*
> -		 * If we are direct reclaiming for contiguous pages and we do
> -		 * not reclaim everything in the list, try again and wait
> -		 * for IO to complete. This will stall high-order allocations
> -		 * but that should be acceptable to the caller
> -		 */
> -		if (nr_freed < nr_taken && !current_is_kswapd() &&
> -		    sc->lumpy_reclaim_mode) {
> +		/* Check if we should syncronously wait for writeback */
> +		if (should_reclaim_stall(nr_taken, nr_freed, priority, sc)) {
>  			congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  			/*
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-23  9:45               ` Wu Fengguang
@ 2010-07-23 10:57                 ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-23 10:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli, Minchan Kim

On Fri, Jul 23, 2010 at 05:45:15PM +0800, Wu Fengguang wrote:
> On Thu, Jul 22, 2010 at 06:48:23PM +0800, Mel Gorman wrote:
> > On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote:
> > > > I guess this new patch is more problem oriented and acceptable:
> > > > 
> > > > --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> > > > +++ linux-next/mm/vmscan.c	2010-07-22 16:39:57.000000000 +0800
> > > > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
> > > >  			count_vm_events(PGDEACTIVATE, nr_active);
> > > >  
> > > >  			nr_freed += shrink_page_list(&page_list, sc,
> > > > -							PAGEOUT_IO_SYNC);
> > > > +					priority < DEF_PRIORITY / 3 ?
> > > > +					PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
> > > >  		}
> > > >  
> > > >  		nr_reclaimed += nr_freed;
> > > 
> > > This one looks better:
> > > ---
> > > vmscan: raise the bar to PAGEOUT_IO_SYNC stalls
> > > 
> > > Fix "system goes totally unresponsive with many dirty/writeback pages"
> > > problem:
> > > 
> > > 	http://lkml.org/lkml/2010/4/4/86
> > > 
> > > The root cause is, wait_on_page_writeback() is called too early in the
> > > direct reclaim path, which blocks many random/unrelated processes when
> > > some slow (USB stick) writeback is on the way.
> > > 
> > 
> > So, what's the bet if lumpy reclaim is a factor that it's
> > high-order-but-low-cost such as fork() that are getting caught by this since
> > [78dc583d: vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC]
> > was introduced?
> 
> Sorry I'm a bit confused by your wording..
> 

After reading the thread, I realised that fork() stalling could be a
factor. That commit allows lumpy reclaim and PAGEOUT_IO_SYNC to be used for
high-order allocations such as those used by fork(). It might have been an
oversight to allow order-1 to use PAGEOUT_IO_SYNC too easily.

> > That could manifest to the user as stalls creating new processes when under
> > heavy IO. I would be surprised it would freeze the entire system but certainly
> > any new work would feel very slow.
> > 
> > > A simple dd can easily create a big range of dirty pages in the LRU
> > > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
> > > typical desktop, which triggers the lumpy reclaim mode and hence
> > > wait_on_page_writeback().
> > > 
> > 
> > which triggers the lumpy reclaim mode for high-order allocations.
> 
> Exactly. Changelog updated.
> 
> > lumpy reclaim mode is not something that is triggered just because priority
> > is high.
> 
> Right.
> 
> > I think there is a second possibility for causing stalls as well that is
> > unrelated to lumpy reclaim. Once dirty_limit is reached, new page faults may
> > also result in stalls. If it is taking a long time to writeback dirty data,
> > random processes could be getting stalled just because they happened to dirty
> > data at the wrong time.  This would be the case if the main dirtying process
> > (e.g. dd) is not calling sync and dropping pages it's no longer using.
> 
> The dirty_limit throttling will slow down the dirty process to the
> writeback throughput. If a process is dirtying files on sda (HDD),
> it will be throttled at 80MB/s. If another process is dirtying files
> on sdb (USB 1.1), it will be throttled at 1MB/s.
> 

It will slow down the dirty process doing the dd, but can it also slow
down other processes that just happened to dirty pages at the wrong
time.

> So dirty throttling will slow things down. However the slow down
> should be smooth (a series of 100ms stalls instead of a sudden 10s
> stall), and won't impact random processes (which does no read/write IO
> at all).
> 

Ok.

> > > In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to
> > > the 22MB writeback and 190MB dirty pages. There can easily be a
> > > continuous range of 512KB dirty/writeback pages in the LRU, which will
> > > trigger the wait logic.
> > > 
> > > To make it worse, when there are 50MB writeback pages and USB 1.1 is
> > > writing them in 1MB/s, wait_on_page_writeback() may stuck for up to 50
> > > seconds.
> > > 
> > > So only enter sync write&wait when priority goes below DEF_PRIORITY/3,
> > > or 6.25% LRU. As the default dirty throttle ratio is 20%, sync write&wait
> > > will hardly be triggered by pure dirty pages.
> > > 
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  mm/vmscan.c |    4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> > > +++ linux-next/mm/vmscan.c	2010-07-22 17:03:47.000000000 +0800
> > > @@ -1206,7 +1206,7 @@ static unsigned long shrink_inactive_lis
> > >  		 * but that should be acceptable to the caller
> > >  		 */
> > >  		if (nr_freed < nr_taken && !current_is_kswapd() &&
> > > -		    sc->lumpy_reclaim_mode) {
> > > +		    sc->lumpy_reclaim_mode && priority < DEF_PRIORITY / 3) {
> > >  			congestion_wait(BLK_RW_ASYNC, HZ/10);
> > >  
> > 
> > This will also delay waiting on congestion for really high-order
> > allocations such as huge pages, some video decoder and the like which
> > really should be stalling.
> 
> I absolutely agree that high order allocators should be somehow throttled.
> 
> However given that one can easily create a large _continuous_ range of
> dirty LRU pages, let someone bumping all the way through the range
> sounds a bit cruel..
> 
> > How about the following compile-tested diff?
> > It takes the cost of the high-order allocation into account and the
> > priority when deciding whether to synchronously wait or not.
> 
> Very nice patch. Thanks!
> 

Will you be picking it up or should I? The changelog should be more or less
the same as yours and consider it

Signed-off-by: Mel Gorman <mel@csn.ul.ie>

It'd be nice if the original tester is still knocking around and willing
to confirm the patch resolves his/her problem. I am running this patch on
my desktop at the moment and it does feel a little smoother but it might be
my imagination. I had trouble with odd stalls that I never pinned down and
was attributing to the machine being commonly heavily loaded but I haven't
noticed them today.

It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
should use PAGEOUT_IO_SYNC]

Thanks

> <SNIP>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-23 10:57                 ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-23 10:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli, Minchan Kim

On Fri, Jul 23, 2010 at 05:45:15PM +0800, Wu Fengguang wrote:
> On Thu, Jul 22, 2010 at 06:48:23PM +0800, Mel Gorman wrote:
> > On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote:
> > > > I guess this new patch is more problem oriented and acceptable:
> > > > 
> > > > --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> > > > +++ linux-next/mm/vmscan.c	2010-07-22 16:39:57.000000000 +0800
> > > > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
> > > >  			count_vm_events(PGDEACTIVATE, nr_active);
> > > >  
> > > >  			nr_freed += shrink_page_list(&page_list, sc,
> > > > -							PAGEOUT_IO_SYNC);
> > > > +					priority < DEF_PRIORITY / 3 ?
> > > > +					PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
> > > >  		}
> > > >  
> > > >  		nr_reclaimed += nr_freed;
> > > 
> > > This one looks better:
> > > ---
> > > vmscan: raise the bar to PAGEOUT_IO_SYNC stalls
> > > 
> > > Fix "system goes totally unresponsive with many dirty/writeback pages"
> > > problem:
> > > 
> > > 	http://lkml.org/lkml/2010/4/4/86
> > > 
> > > The root cause is, wait_on_page_writeback() is called too early in the
> > > direct reclaim path, which blocks many random/unrelated processes when
> > > some slow (USB stick) writeback is on the way.
> > > 
> > 
> > So, what's the bet if lumpy reclaim is a factor that it's
> > high-order-but-low-cost such as fork() that are getting caught by this since
> > [78dc583d: vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC]
> > was introduced?
> 
> Sorry I'm a bit confused by your wording..
> 

After reading the thread, I realised that fork() stalling could be a
factor. That commit allows lumpy reclaim and PAGEOUT_IO_SYNC to be used for
high-order allocations such as those used by fork(). It might have been an
oversight to allow order-1 to use PAGEOUT_IO_SYNC too easily.

> > That could manifest to the user as stalls creating new processes when under
> > heavy IO. I would be surprised it would freeze the entire system but certainly
> > any new work would feel very slow.
> > 
> > > A simple dd can easily create a big range of dirty pages in the LRU
> > > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
> > > typical desktop, which triggers the lumpy reclaim mode and hence
> > > wait_on_page_writeback().
> > > 
> > 
> > which triggers the lumpy reclaim mode for high-order allocations.
> 
> Exactly. Changelog updated.
> 
> > lumpy reclaim mode is not something that is triggered just because priority
> > is high.
> 
> Right.
> 
> > I think there is a second possibility for causing stalls as well that is
> > unrelated to lumpy reclaim. Once dirty_limit is reached, new page faults may
> > also result in stalls. If it is taking a long time to writeback dirty data,
> > random processes could be getting stalled just because they happened to dirty
> > data at the wrong time.  This would be the case if the main dirtying process
> > (e.g. dd) is not calling sync and dropping pages it's no longer using.
> 
> The dirty_limit throttling will slow down the dirty process to the
> writeback throughput. If a process is dirtying files on sda (HDD),
> it will be throttled at 80MB/s. If another process is dirtying files
> on sdb (USB 1.1), it will be throttled at 1MB/s.
> 

It will slow down the dirty process doing the dd, but can it also slow
down other processes that just happened to dirty pages at the wrong
time.

> So dirty throttling will slow things down. However the slow down
> should be smooth (a series of 100ms stalls instead of a sudden 10s
> stall), and won't impact random processes (which does no read/write IO
> at all).
> 

Ok.

> > > In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to
> > > the 22MB writeback and 190MB dirty pages. There can easily be a
> > > continuous range of 512KB dirty/writeback pages in the LRU, which will
> > > trigger the wait logic.
> > > 
> > > To make it worse, when there are 50MB writeback pages and USB 1.1 is
> > > writing them in 1MB/s, wait_on_page_writeback() may stuck for up to 50
> > > seconds.
> > > 
> > > So only enter sync write&wait when priority goes below DEF_PRIORITY/3,
> > > or 6.25% LRU. As the default dirty throttle ratio is 20%, sync write&wait
> > > will hardly be triggered by pure dirty pages.
> > > 
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  mm/vmscan.c |    4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> > > +++ linux-next/mm/vmscan.c	2010-07-22 17:03:47.000000000 +0800
> > > @@ -1206,7 +1206,7 @@ static unsigned long shrink_inactive_lis
> > >  		 * but that should be acceptable to the caller
> > >  		 */
> > >  		if (nr_freed < nr_taken && !current_is_kswapd() &&
> > > -		    sc->lumpy_reclaim_mode) {
> > > +		    sc->lumpy_reclaim_mode && priority < DEF_PRIORITY / 3) {
> > >  			congestion_wait(BLK_RW_ASYNC, HZ/10);
> > >  
> > 
> > This will also delay waiting on congestion for really high-order
> > allocations such as huge pages, some video decoder and the like which
> > really should be stalling.
> 
> I absolutely agree that high order allocators should be somehow throttled.
> 
> However given that one can easily create a large _continuous_ range of
> dirty LRU pages, let someone bumping all the way through the range
> sounds a bit cruel..
> 
> > How about the following compile-tested diff?
> > It takes the cost of the high-order allocation into account and the
> > priority when deciding whether to synchronously wait or not.
> 
> Very nice patch. Thanks!
> 

Will you be picking it up or should I? The changelog should be more or less
the same as yours and consider it

Signed-off-by: Mel Gorman <mel@csn.ul.ie>

It'd be nice if the original tester is still knocking around and willing
to confirm the patch resolves his/her problem. I am running this patch on
my desktop at the moment and it does feel a little smoother but it might be
my imagination. I had trouble with odd stalls that I never pinned down and
was attributing to the machine being commonly heavily loaded but I haven't
noticed them today.

It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
should use PAGEOUT_IO_SYNC]

Thanks

> <SNIP>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-23 10:57                 ` Mel Gorman
@ 2010-07-23 11:49                   ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-23 11:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andreas Mohr, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli, Minchan Kim

On Fri, Jul 23, 2010 at 06:57:19PM +0800, Mel Gorman wrote:
> On Fri, Jul 23, 2010 at 05:45:15PM +0800, Wu Fengguang wrote:
> > On Thu, Jul 22, 2010 at 06:48:23PM +0800, Mel Gorman wrote:
> > > On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote:
> > > > > I guess this new patch is more problem oriented and acceptable:
> > > > > 
> > > > > --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> > > > > +++ linux-next/mm/vmscan.c	2010-07-22 16:39:57.000000000 +0800
> > > > > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
> > > > >  			count_vm_events(PGDEACTIVATE, nr_active);
> > > > >  
> > > > >  			nr_freed += shrink_page_list(&page_list, sc,
> > > > > -							PAGEOUT_IO_SYNC);
> > > > > +					priority < DEF_PRIORITY / 3 ?
> > > > > +					PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
> > > > >  		}
> > > > >  
> > > > >  		nr_reclaimed += nr_freed;
> > > > 
> > > > This one looks better:
> > > > ---
> > > > vmscan: raise the bar to PAGEOUT_IO_SYNC stalls
> > > > 
> > > > Fix "system goes totally unresponsive with many dirty/writeback pages"
> > > > problem:
> > > > 
> > > > 	http://lkml.org/lkml/2010/4/4/86
> > > > 
> > > > The root cause is, wait_on_page_writeback() is called too early in the
> > > > direct reclaim path, which blocks many random/unrelated processes when
> > > > some slow (USB stick) writeback is on the way.
> > > > 
> > > 
> > > So, what's the bet if lumpy reclaim is a factor that it's
> > > high-order-but-low-cost such as fork() that are getting caught by this since
> > > [78dc583d: vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC]
> > > was introduced?
> > 
> > Sorry I'm a bit confused by your wording..
> > 
> 
> After reading the thread, I realised that fork() stalling could be a
> factor. That commit allows lumpy reclaim and PAGEOUT_IO_SYNC to be used for
> high-order allocations such as those used by fork(). It might have been an
> oversight to allow order-1 to use PAGEOUT_IO_SYNC too easily.

That reads much clear. Thanks! I have the same feeling, hence the
proposed patch.

> > > That could manifest to the user as stalls creating new processes when under
> > > heavy IO. I would be surprised it would freeze the entire system but certainly
> > > any new work would feel very slow.
> > > 
> > > > A simple dd can easily create a big range of dirty pages in the LRU
> > > > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
> > > > typical desktop, which triggers the lumpy reclaim mode and hence
> > > > wait_on_page_writeback().
> > > > 
> > > 
> > > which triggers the lumpy reclaim mode for high-order allocations.
> > 
> > Exactly. Changelog updated.
> > 
> > > lumpy reclaim mode is not something that is triggered just because priority
> > > is high.
> > 
> > Right.
> > 
> > > I think there is a second possibility for causing stalls as well that is
> > > unrelated to lumpy reclaim. Once dirty_limit is reached, new page faults may
> > > also result in stalls. If it is taking a long time to writeback dirty data,
> > > random processes could be getting stalled just because they happened to dirty
> > > data at the wrong time.  This would be the case if the main dirtying process
> > > (e.g. dd) is not calling sync and dropping pages it's no longer using.
> > 
> > The dirty_limit throttling will slow down the dirty process to the
> > writeback throughput. If a process is dirtying files on sda (HDD),
> > it will be throttled at 80MB/s. If another process is dirtying files
> > on sdb (USB 1.1), it will be throttled at 1MB/s.
> > 
> 
> It will slow down the dirty process doing the dd, but can it also slow
> down other processes that just happened to dirty pages at the wrong
> time.

For the case of of a heavy dirtier (dd) and concurrent light dirtiers
(some random processes), the light dirtiers won't be easily throttled.
task_dirty_limit() handles that case well. It will give light dirtiers
higher threshold than heavy dirtiers so that only the latter will be
dirty throttled.

> > So dirty throttling will slow things down. However the slow down
> > should be smooth (a series of 100ms stalls instead of a sudden 10s
> > stall), and won't impact random processes (which does no read/write IO
> > at all).
> > 
> 
> Ok.
> 
> > > > In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to
> > > > the 22MB writeback and 190MB dirty pages. There can easily be a
> > > > continuous range of 512KB dirty/writeback pages in the LRU, which will
> > > > trigger the wait logic.
> > > > 
> > > > To make it worse, when there are 50MB writeback pages and USB 1.1 is
> > > > writing them in 1MB/s, wait_on_page_writeback() may stuck for up to 50
> > > > seconds.
> > > > 
> > > > So only enter sync write&wait when priority goes below DEF_PRIORITY/3,
> > > > or 6.25% LRU. As the default dirty throttle ratio is 20%, sync write&wait
> > > > will hardly be triggered by pure dirty pages.
> > > > 
> > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > ---
> > > >  mm/vmscan.c |    4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > > 
> > > > --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> > > > +++ linux-next/mm/vmscan.c	2010-07-22 17:03:47.000000000 +0800
> > > > @@ -1206,7 +1206,7 @@ static unsigned long shrink_inactive_lis
> > > >  		 * but that should be acceptable to the caller
> > > >  		 */
> > > >  		if (nr_freed < nr_taken && !current_is_kswapd() &&
> > > > -		    sc->lumpy_reclaim_mode) {
> > > > +		    sc->lumpy_reclaim_mode && priority < DEF_PRIORITY / 3) {
> > > >  			congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > >  
> > > 
> > > This will also delay waiting on congestion for really high-order
> > > allocations such as huge pages, some video decoder and the like which
> > > really should be stalling.
> > 
> > I absolutely agree that high order allocators should be somehow throttled.

> > However given that one can easily create a large _continuous_ range of
> > dirty LRU pages, let someone bumping all the way through the range
> > sounds a bit cruel..

Hmm. If such large range of dirty pages are approaching the end of LRU,
it means the LRU lists are being scanned pretty fast, indicating a
busy system and/or high memory pressure. So it seems reasonable to act
cruel to really high order allocators -- they won't perform well under
memory pressure after all, and only make things worse.

> > > How about the following compile-tested diff?
> > > It takes the cost of the high-order allocation into account and the
> > > priority when deciding whether to synchronously wait or not.
> > 
> > Very nice patch. Thanks!
> > 
> 
> Will you be picking it up or should I? The changelog should be more or less
> the same as yours and consider it
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Thanks. I'll post the patch.

> It'd be nice if the original tester is still knocking around and willing
> to confirm the patch resolves his/her problem. I am running this patch on
> my desktop at the moment and it does feel a little smoother but it might be
> my imagination. I had trouble with odd stalls that I never pinned down and
> was attributing to the machine being commonly heavily loaded but I haven't
> noticed them today.

Great. Just added CC to Andreas Mohr.

> It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
> logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
> should use PAGEOUT_IO_SYNC]

And Minchan, he has been following this issue too :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-23 11:49                   ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-23 11:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andreas Mohr, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli, Minchan Kim

On Fri, Jul 23, 2010 at 06:57:19PM +0800, Mel Gorman wrote:
> On Fri, Jul 23, 2010 at 05:45:15PM +0800, Wu Fengguang wrote:
> > On Thu, Jul 22, 2010 at 06:48:23PM +0800, Mel Gorman wrote:
> > > On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote:
> > > > > I guess this new patch is more problem oriented and acceptable:
> > > > > 
> > > > > --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> > > > > +++ linux-next/mm/vmscan.c	2010-07-22 16:39:57.000000000 +0800
> > > > > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
> > > > >  			count_vm_events(PGDEACTIVATE, nr_active);
> > > > >  
> > > > >  			nr_freed += shrink_page_list(&page_list, sc,
> > > > > -							PAGEOUT_IO_SYNC);
> > > > > +					priority < DEF_PRIORITY / 3 ?
> > > > > +					PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
> > > > >  		}
> > > > >  
> > > > >  		nr_reclaimed += nr_freed;
> > > > 
> > > > This one looks better:
> > > > ---
> > > > vmscan: raise the bar to PAGEOUT_IO_SYNC stalls
> > > > 
> > > > Fix "system goes totally unresponsive with many dirty/writeback pages"
> > > > problem:
> > > > 
> > > > 	http://lkml.org/lkml/2010/4/4/86
> > > > 
> > > > The root cause is, wait_on_page_writeback() is called too early in the
> > > > direct reclaim path, which blocks many random/unrelated processes when
> > > > some slow (USB stick) writeback is on the way.
> > > > 
> > > 
> > > So, what's the bet if lumpy reclaim is a factor that it's
> > > high-order-but-low-cost such as fork() that are getting caught by this since
> > > [78dc583d: vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC]
> > > was introduced?
> > 
> > Sorry I'm a bit confused by your wording..
> > 
> 
> After reading the thread, I realised that fork() stalling could be a
> factor. That commit allows lumpy reclaim and PAGEOUT_IO_SYNC to be used for
> high-order allocations such as those used by fork(). It might have been an
> oversight to allow order-1 to use PAGEOUT_IO_SYNC too easily.

That reads much clear. Thanks! I have the same feeling, hence the
proposed patch.

> > > That could manifest to the user as stalls creating new processes when under
> > > heavy IO. I would be surprised it would freeze the entire system but certainly
> > > any new work would feel very slow.
> > > 
> > > > A simple dd can easily create a big range of dirty pages in the LRU
> > > > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
> > > > typical desktop, which triggers the lumpy reclaim mode and hence
> > > > wait_on_page_writeback().
> > > > 
> > > 
> > > which triggers the lumpy reclaim mode for high-order allocations.
> > 
> > Exactly. Changelog updated.
> > 
> > > lumpy reclaim mode is not something that is triggered just because priority
> > > is high.
> > 
> > Right.
> > 
> > > I think there is a second possibility for causing stalls as well that is
> > > unrelated to lumpy reclaim. Once dirty_limit is reached, new page faults may
> > > also result in stalls. If it is taking a long time to writeback dirty data,
> > > random processes could be getting stalled just because they happened to dirty
> > > data at the wrong time.  This would be the case if the main dirtying process
> > > (e.g. dd) is not calling sync and dropping pages it's no longer using.
> > 
> > The dirty_limit throttling will slow down the dirty process to the
> > writeback throughput. If a process is dirtying files on sda (HDD),
> > it will be throttled at 80MB/s. If another process is dirtying files
> > on sdb (USB 1.1), it will be throttled at 1MB/s.
> > 
> 
> It will slow down the dirty process doing the dd, but can it also slow
> down other processes that just happened to dirty pages at the wrong
> time.

For the case of of a heavy dirtier (dd) and concurrent light dirtiers
(some random processes), the light dirtiers won't be easily throttled.
task_dirty_limit() handles that case well. It will give light dirtiers
higher threshold than heavy dirtiers so that only the latter will be
dirty throttled.

> > So dirty throttling will slow things down. However the slow down
> > should be smooth (a series of 100ms stalls instead of a sudden 10s
> > stall), and won't impact random processes (which does no read/write IO
> > at all).
> > 
> 
> Ok.
> 
> > > > In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to
> > > > the 22MB writeback and 190MB dirty pages. There can easily be a
> > > > continuous range of 512KB dirty/writeback pages in the LRU, which will
> > > > trigger the wait logic.
> > > > 
> > > > To make it worse, when there are 50MB writeback pages and USB 1.1 is
> > > > writing them in 1MB/s, wait_on_page_writeback() may stuck for up to 50
> > > > seconds.
> > > > 
> > > > So only enter sync write&wait when priority goes below DEF_PRIORITY/3,
> > > > or 6.25% LRU. As the default dirty throttle ratio is 20%, sync write&wait
> > > > will hardly be triggered by pure dirty pages.
> > > > 
> > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > ---
> > > >  mm/vmscan.c |    4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > > 
> > > > --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> > > > +++ linux-next/mm/vmscan.c	2010-07-22 17:03:47.000000000 +0800
> > > > @@ -1206,7 +1206,7 @@ static unsigned long shrink_inactive_lis
> > > >  		 * but that should be acceptable to the caller
> > > >  		 */
> > > >  		if (nr_freed < nr_taken && !current_is_kswapd() &&
> > > > -		    sc->lumpy_reclaim_mode) {
> > > > +		    sc->lumpy_reclaim_mode && priority < DEF_PRIORITY / 3) {
> > > >  			congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > >  
> > > 
> > > This will also delay waiting on congestion for really high-order
> > > allocations such as huge pages, some video decoder and the like which
> > > really should be stalling.
> > 
> > I absolutely agree that high order allocators should be somehow throttled.

> > However given that one can easily create a large _continuous_ range of
> > dirty LRU pages, let someone bumping all the way through the range
> > sounds a bit cruel..

Hmm. If such large range of dirty pages are approaching the end of LRU,
it means the LRU lists are being scanned pretty fast, indicating a
busy system and/or high memory pressure. So it seems reasonable to act
cruel to really high order allocators -- they won't perform well under
memory pressure after all, and only make things worse.

> > > How about the following compile-tested diff?
> > > It takes the cost of the high-order allocation into account and the
> > > priority when deciding whether to synchronously wait or not.
> > 
> > Very nice patch. Thanks!
> > 
> 
> Will you be picking it up or should I? The changelog should be more or less
> the same as yours and consider it
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Thanks. I'll post the patch.

> It'd be nice if the original tester is still knocking around and willing
> to confirm the patch resolves his/her problem. I am running this patch on
> my desktop at the moment and it does feel a little smoother but it might be
> my imagination. I had trouble with odd stalls that I never pinned down and
> was attributing to the machine being commonly heavily loaded but I haven't
> noticed them today.

Great. Just added CC to Andreas Mohr.

> It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
> logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
> should use PAGEOUT_IO_SYNC]

And Minchan, he has been following this issue too :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-22 15:34             ` Minchan Kim
@ 2010-07-23 11:59               ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-23 11:59 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli

Hi Minchan,

On Thu, Jul 22, 2010 at 11:34:40PM +0800, Minchan Kim wrote:
> Hi, Wu. 
> Thanks for Cced me. 
> 
> AFAIR, we discussed this by private mail and didn't conclude yet. 
> Let's start from beginning. 

OK.

> On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote:
> > > I guess this new patch is more problem oriented and acceptable:
> > > 
> > > --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> > > +++ linux-next/mm/vmscan.c	2010-07-22 16:39:57.000000000 +0800
> > > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
> > >  			count_vm_events(PGDEACTIVATE, nr_active);
> > >  
> > >  			nr_freed += shrink_page_list(&page_list, sc,
> > > -							PAGEOUT_IO_SYNC);
> > > +					priority < DEF_PRIORITY / 3 ?
> > > +					PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
> > >  		}
> > >  
> > >  		nr_reclaimed += nr_freed;
> > 
> > This one looks better:
> > ---
> > vmscan: raise the bar to PAGEOUT_IO_SYNC stalls
> > 
> > Fix "system goes totally unresponsive with many dirty/writeback pages"
> > problem:
> > 
> > 	http://lkml.org/lkml/2010/4/4/86
> > 
> > The root cause is, wait_on_page_writeback() is called too early in the
> > direct reclaim path, which blocks many random/unrelated processes when
> > some slow (USB stick) writeback is on the way.
> > 
> > A simple dd can easily create a big range of dirty pages in the LRU
> > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
> > typical desktop, which triggers the lumpy reclaim mode and hence
> > wait_on_page_writeback().
> 
> I see oom message. order is zero. 

OOM after applying this patch?  It's not an obvious consequence.

> How is lumpy reclaim work?
> For working lumpy reclaim, we have to meet priority < 10 and sc->order > 0.
>
> Please, clarify the problem.
 
This patch tries to respect the lumpy reclaim logic, and only raises
the bar for sync writeback and IO wait. With Mel's change, it's only
doing so for (order <= PAGE_ALLOC_COSTLY_ORDER) allocations. Hopefully
this will limit unexpected side effects.

> > 
> > In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to
> > the 22MB writeback and 190MB dirty pages. There can easily be a
> 
> What's 22MB and 190M?

The numbers are adapted from the OOM dmesg in
http://lkml.org/lkml/2010/4/4/86 . The OOM is order 0 and GFP_KERNEL.

> It would be better to explain more detail. 
> I think the description has to be clear as summary of the problem 
> without the above link. 

Good suggestion. I'll try.

> Thanks for taking out this problem, again. :)

Heh, I'm actually feeling guilty for the long delay!

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-23 11:59               ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-23 11:59 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli

Hi Minchan,

On Thu, Jul 22, 2010 at 11:34:40PM +0800, Minchan Kim wrote:
> Hi, Wu. 
> Thanks for Cced me. 
> 
> AFAIR, we discussed this by private mail and didn't conclude yet. 
> Let's start from beginning. 

OK.

> On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote:
> > > I guess this new patch is more problem oriented and acceptable:
> > > 
> > > --- linux-next.orig/mm/vmscan.c	2010-07-22 16:36:58.000000000 +0800
> > > +++ linux-next/mm/vmscan.c	2010-07-22 16:39:57.000000000 +0800
> > > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
> > >  			count_vm_events(PGDEACTIVATE, nr_active);
> > >  
> > >  			nr_freed += shrink_page_list(&page_list, sc,
> > > -							PAGEOUT_IO_SYNC);
> > > +					priority < DEF_PRIORITY / 3 ?
> > > +					PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
> > >  		}
> > >  
> > >  		nr_reclaimed += nr_freed;
> > 
> > This one looks better:
> > ---
> > vmscan: raise the bar to PAGEOUT_IO_SYNC stalls
> > 
> > Fix "system goes totally unresponsive with many dirty/writeback pages"
> > problem:
> > 
> > 	http://lkml.org/lkml/2010/4/4/86
> > 
> > The root cause is, wait_on_page_writeback() is called too early in the
> > direct reclaim path, which blocks many random/unrelated processes when
> > some slow (USB stick) writeback is on the way.
> > 
> > A simple dd can easily create a big range of dirty pages in the LRU
> > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
> > typical desktop, which triggers the lumpy reclaim mode and hence
> > wait_on_page_writeback().
> 
> I see oom message. order is zero. 

OOM after applying this patch?  It's not an obvious consequence.

> How is lumpy reclaim work?
> For working lumpy reclaim, we have to meet priority < 10 and sc->order > 0.
>
> Please, clarify the problem.
 
This patch tries to respect the lumpy reclaim logic, and only raises
the bar for sync writeback and IO wait. With Mel's change, it's only
doing so for (order <= PAGE_ALLOC_COSTLY_ORDER) allocations. Hopefully
this will limit unexpected side effects.

> > 
> > In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to
> > the 22MB writeback and 190MB dirty pages. There can easily be a
> 
> What's 22MB and 190M?

The numbers are adapted from the OOM dmesg in
http://lkml.org/lkml/2010/4/4/86 . The OOM is order 0 and GFP_KERNEL.

> It would be better to explain more detail. 
> I think the description has to be clear as summary of the problem 
> without the above link. 

Good suggestion. I'll try.

> Thanks for taking out this problem, again. :)

Heh, I'm actually feeling guilty for the long delay!

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-23 11:49                   ` Wu Fengguang
@ 2010-07-23 12:20                     ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-23 12:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andreas Mohr, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Minchan Kim

> For the case of of a heavy dirtier (dd) and concurrent light dirtiers
> (some random processes), the light dirtiers won't be easily throttled.
> task_dirty_limit() handles that case well. It will give light dirtiers
> higher threshold than heavy dirtiers so that only the latter will be
> dirty throttled.

The caveat is, the real dirty throttling threshold is not exactly the
value specified by vm.dirty_ratio or vm.dirty_bytes. Instead it's some
value slightly lower than it. That real value differs for each process,
which is a nice trick to throttle heavy dirtiers first. If I remember
it right, that's invented by Peter and Andrew.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-23 12:20                     ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-23 12:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andreas Mohr, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Minchan Kim

> For the case of of a heavy dirtier (dd) and concurrent light dirtiers
> (some random processes), the light dirtiers won't be easily throttled.
> task_dirty_limit() handles that case well. It will give light dirtiers
> higher threshold than heavy dirtiers so that only the latter will be
> dirty throttled.

The caveat is, the real dirty throttling threshold is not exactly the
value specified by vm.dirty_ratio or vm.dirty_bytes. Instead it's some
value slightly lower than it. That real value differs for each process,
which is a nice trick to throttle heavy dirtiers first. If I remember
it right, that's invented by Peter and Andrew.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-23 10:57                 ` Mel Gorman
@ 2010-07-25 10:43                   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 177+ messages in thread
From: KOSAKI Motohiro @ 2010-07-25 10:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Wu Fengguang, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli, Minchan Kim

Hi

sorry for the delay.

> Will you be picking it up or should I? The changelog should be more or less
> the same as yours and consider it
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> It'd be nice if the original tester is still knocking around and willing
> to confirm the patch resolves his/her problem. I am running this patch on
> my desktop at the moment and it does feel a little smoother but it might be
> my imagination. I had trouble with odd stalls that I never pinned down and
> was attributing to the machine being commonly heavily loaded but I haven't
> noticed them today.
> 
> It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
> logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
> should use PAGEOUT_IO_SYNC]

My reviewing doesn't found any bug. however I think original thread have too many guess
and we need to know reproduce way and confirm it.

At least, we need three confirms.
 o original issue is still there?
 o DEF_PRIORITY/3 is best value?
 o Current approach have better performance than Wu's original proposal? (below)


Anyway, please feel free to use my reviewed-by tag.

Thanks.



--- linux-next.orig/mm/vmscan.c	2010-06-24 14:32:03.000000000 +0800
+++ linux-next/mm/vmscan.c	2010-07-22 16:12:34.000000000 +0800
@@ -1650,7 +1650,7 @@ static void set_lumpy_reclaim_mode(int p
 	 */
 	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
 		sc->lumpy_reclaim_mode = 1;
-	else if (sc->order && priority < DEF_PRIORITY - 2)
+	else if (sc->order && priority < DEF_PRIORITY / 2)
 		sc->lumpy_reclaim_mode = 1;
 	else
 		sc->lumpy_reclaim_mode = 0;


^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-25 10:43                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 177+ messages in thread
From: KOSAKI Motohiro @ 2010-07-25 10:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Wu Fengguang, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli, Minchan Kim

Hi

sorry for the delay.

> Will you be picking it up or should I? The changelog should be more or less
> the same as yours and consider it
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> It'd be nice if the original tester is still knocking around and willing
> to confirm the patch resolves his/her problem. I am running this patch on
> my desktop at the moment and it does feel a little smoother but it might be
> my imagination. I had trouble with odd stalls that I never pinned down and
> was attributing to the machine being commonly heavily loaded but I haven't
> noticed them today.
> 
> It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
> logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
> should use PAGEOUT_IO_SYNC]

My reviewing doesn't found any bug. however I think original thread have too many guess
and we need to know reproduce way and confirm it.

At least, we need three confirms.
 o original issue is still there?
 o DEF_PRIORITY/3 is best value?
 o Current approach have better performance than Wu's original proposal? (below)


Anyway, please feel free to use my reviewed-by tag.

Thanks.



--- linux-next.orig/mm/vmscan.c	2010-06-24 14:32:03.000000000 +0800
+++ linux-next/mm/vmscan.c	2010-07-22 16:12:34.000000000 +0800
@@ -1650,7 +1650,7 @@ static void set_lumpy_reclaim_mode(int p
 	 */
 	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
 		sc->lumpy_reclaim_mode = 1;
-	else if (sc->order && priority < DEF_PRIORITY - 2)
+	else if (sc->order && priority < DEF_PRIORITY / 2)
 		sc->lumpy_reclaim_mode = 1;
 	else
 		sc->lumpy_reclaim_mode = 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-25 10:43                   ` KOSAKI Motohiro
@ 2010-07-25 12:03                     ` Minchan Kim
  -1 siblings, 0 replies; 177+ messages in thread
From: Minchan Kim @ 2010-07-25 12:03 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Wu Fengguang, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli

On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> sorry for the delay.
> 
> > Will you be picking it up or should I? The changelog should be more or less
> > the same as yours and consider it
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > It'd be nice if the original tester is still knocking around and willing
> > to confirm the patch resolves his/her problem. I am running this patch on
> > my desktop at the moment and it does feel a little smoother but it might be
> > my imagination. I had trouble with odd stalls that I never pinned down and
> > was attributing to the machine being commonly heavily loaded but I haven't
> > noticed them today.
> > 
> > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
> > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
> > should use PAGEOUT_IO_SYNC]
> 
> My reviewing doesn't found any bug. however I think original thread have too many guess
> and we need to know reproduce way and confirm it.
> 
> At least, we need three confirms.
>  o original issue is still there?
>  o DEF_PRIORITY/3 is best value?

I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU?
I guess system has 512M and 22M writeback pages. 
So you may determine it for skipping max 32M writeback pages.
Is right?

And I have a question of your below comment. 

"As the default dirty throttle ratio is 20%, sync write&wait
will hardly be triggered by pure dirty pages"

I am not sure exactly what you mean but at least DEF_PRIOIRTY/3 seems to be
related to dirty_ratio. It always can be changed by admin.
Then do we have to determine magic value(DEF_PRIORITY/3)  proportional to dirty_ratio?

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-25 12:03                     ` Minchan Kim
  0 siblings, 0 replies; 177+ messages in thread
From: Minchan Kim @ 2010-07-25 12:03 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Wu Fengguang, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli

On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> sorry for the delay.
> 
> > Will you be picking it up or should I? The changelog should be more or less
> > the same as yours and consider it
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > It'd be nice if the original tester is still knocking around and willing
> > to confirm the patch resolves his/her problem. I am running this patch on
> > my desktop at the moment and it does feel a little smoother but it might be
> > my imagination. I had trouble with odd stalls that I never pinned down and
> > was attributing to the machine being commonly heavily loaded but I haven't
> > noticed them today.
> > 
> > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
> > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
> > should use PAGEOUT_IO_SYNC]
> 
> My reviewing doesn't found any bug. however I think original thread have too many guess
> and we need to know reproduce way and confirm it.
> 
> At least, we need three confirms.
>  o original issue is still there?
>  o DEF_PRIORITY/3 is best value?

I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU?
I guess system has 512M and 22M writeback pages. 
So you may determine it for skipping max 32M writeback pages.
Is right?

And I have a question of your below comment. 

"As the default dirty throttle ratio is 20%, sync write&wait
will hardly be triggered by pure dirty pages"

I am not sure exactly what you mean but at least DEF_PRIOIRTY/3 seems to be
related to dirty_ratio. It always can be changed by admin.
Then do we have to determine magic value(DEF_PRIORITY/3)  proportional to dirty_ratio?

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-25 10:43                   ` KOSAKI Motohiro
@ 2010-07-26  3:08                     ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26  3:08 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli, Minchan Kim

KOSAKI,

> My reviewing doesn't found any bug. however I think original thread have too many guess
> and we need to know reproduce way and confirm it.
> 
> At least, we need three confirms.
>  o original issue is still there?

As long as the root cause is still there :)

>  o DEF_PRIORITY/3 is best value?

There are no best value. I suspect the whole PAGEOUT_IO_SYNC and
wait_on_page_writeback() approach is a terrible workaround and should
be avoided as much as possible. This is why I lifted the bar from
DEF_PRIORITY/2 to DEF_PRIORITY/3.

wait_on_page_writeback() is bad because for a typical desktop, one
single call may block 1-10 seconds (remember we are under memory
pressure, which is almost always accompanied with busy disk IO, so
the page will wait noticeable time in the IO queue). To make it worse,
it is very possible there are 10 more dirty/writeback pages in the
isolated pages(dirty pages are often clustered). This ends up with
10-100 seconds stall time.

We do need some throttling under memory pressure. However stall time
more than 1s is not acceptable. A simple congestion_wait() may be
better, since it waits on _any_ IO completion (which will likely
release a set of PG_reclaim pages) rather than one specific IO
completion. This makes much smoother stall time.
wait_on_page_writeback() shall really be the last resort.
DEF_PRIORITY/3 means 1/16=6.25%, which is closer.

Since dirty/writeback pages are such a bad factor under memory
pressure, it may deserve to adaptively shrink dirty_limit as well.
When short on memory, why not reduce the dirty/writeback page cache?
This will not only consume memory, but also considerably improve IO
efficiency and responsiveness. When the LRU lists are scanned fast
(under memory pressure), it is likely lots of the dirty pages are
caught by pageout(). Reducing the number of dirty pages reduces the
pageout() invocations.

>  o Current approach have better performance than Wu's original proposal? (below)

I guess it will have better user experience :)

> Anyway, please feel free to use my reviewed-by tag.
 
Thanks,
Fengguang

> --- linux-next.orig/mm/vmscan.c	2010-06-24 14:32:03.000000000 +0800
> +++ linux-next/mm/vmscan.c	2010-07-22 16:12:34.000000000 +0800
> @@ -1650,7 +1650,7 @@ static void set_lumpy_reclaim_mode(int p
>  	 */
>  	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
>  		sc->lumpy_reclaim_mode = 1;
> -	else if (sc->order && priority < DEF_PRIORITY - 2)
> +	else if (sc->order && priority < DEF_PRIORITY / 2)
>  		sc->lumpy_reclaim_mode = 1;
>  	else
>  		sc->lumpy_reclaim_mode = 0;

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-26  3:08                     ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26  3:08 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli, Minchan Kim

KOSAKI,

> My reviewing doesn't found any bug. however I think original thread have too many guess
> and we need to know reproduce way and confirm it.
> 
> At least, we need three confirms.
>  o original issue is still there?

As long as the root cause is still there :)

>  o DEF_PRIORITY/3 is best value?

There are no best value. I suspect the whole PAGEOUT_IO_SYNC and
wait_on_page_writeback() approach is a terrible workaround and should
be avoided as much as possible. This is why I lifted the bar from
DEF_PRIORITY/2 to DEF_PRIORITY/3.

wait_on_page_writeback() is bad because for a typical desktop, one
single call may block 1-10 seconds (remember we are under memory
pressure, which is almost always accompanied with busy disk IO, so
the page will wait noticeable time in the IO queue). To make it worse,
it is very possible there are 10 more dirty/writeback pages in the
isolated pages(dirty pages are often clustered). This ends up with
10-100 seconds stall time.

We do need some throttling under memory pressure. However stall time
more than 1s is not acceptable. A simple congestion_wait() may be
better, since it waits on _any_ IO completion (which will likely
release a set of PG_reclaim pages) rather than one specific IO
completion. This makes much smoother stall time.
wait_on_page_writeback() shall really be the last resort.
DEF_PRIORITY/3 means 1/16=6.25%, which is closer.

Since dirty/writeback pages are such a bad factor under memory
pressure, it may deserve to adaptively shrink dirty_limit as well.
When short on memory, why not reduce the dirty/writeback page cache?
This will not only consume memory, but also considerably improve IO
efficiency and responsiveness. When the LRU lists are scanned fast
(under memory pressure), it is likely lots of the dirty pages are
caught by pageout(). Reducing the number of dirty pages reduces the
pageout() invocations.

>  o Current approach have better performance than Wu's original proposal? (below)

I guess it will have better user experience :)

> Anyway, please feel free to use my reviewed-by tag.
 
Thanks,
Fengguang

> --- linux-next.orig/mm/vmscan.c	2010-06-24 14:32:03.000000000 +0800
> +++ linux-next/mm/vmscan.c	2010-07-22 16:12:34.000000000 +0800
> @@ -1650,7 +1650,7 @@ static void set_lumpy_reclaim_mode(int p
>  	 */
>  	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
>  		sc->lumpy_reclaim_mode = 1;
> -	else if (sc->order && priority < DEF_PRIORITY - 2)
> +	else if (sc->order && priority < DEF_PRIORITY / 2)
>  		sc->lumpy_reclaim_mode = 1;
>  	else
>  		sc->lumpy_reclaim_mode = 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-26  3:08                     ` Wu Fengguang
@ 2010-07-26  3:11                       ` Rik van Riel
  -1 siblings, 0 replies; 177+ messages in thread
From: Rik van Riel @ 2010-07-26  3:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli, Minchan Kim

On 07/25/2010 11:08 PM, Wu Fengguang wrote:

> We do need some throttling under memory pressure. However stall time
> more than 1s is not acceptable. A simple congestion_wait() may be
> better, since it waits on _any_ IO completion (which will likely
> release a set of PG_reclaim pages) rather than one specific IO
> completion. This makes much smoother stall time.
> wait_on_page_writeback() shall really be the last resort.
> DEF_PRIORITY/3 means 1/16=6.25%, which is closer.

I agree with the max 1 second stall time, but 6.25% of
memory could be an awful lot of pages to scan on a system
with 1TB of memory :)

Not sure what the best approach is, just pointing out
that DEF_PRIORITY/3 may be too much for large systems...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-26  3:11                       ` Rik van Riel
  0 siblings, 0 replies; 177+ messages in thread
From: Rik van Riel @ 2010-07-26  3:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli, Minchan Kim

On 07/25/2010 11:08 PM, Wu Fengguang wrote:

> We do need some throttling under memory pressure. However stall time
> more than 1s is not acceptable. A simple congestion_wait() may be
> better, since it waits on _any_ IO completion (which will likely
> release a set of PG_reclaim pages) rather than one specific IO
> completion. This makes much smoother stall time.
> wait_on_page_writeback() shall really be the last resort.
> DEF_PRIORITY/3 means 1/16=6.25%, which is closer.

I agree with the max 1 second stall time, but 6.25% of
memory could be an awful lot of pages to scan on a system
with 1TB of memory :)

Not sure what the best approach is, just pointing out
that DEF_PRIORITY/3 may be too much for large systems...

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-26  3:11                       ` Rik van Riel
@ 2010-07-26  3:17                         ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26  3:17 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli, Minchan Kim

On Mon, Jul 26, 2010 at 11:11:37AM +0800, Rik van Riel wrote:
> On 07/25/2010 11:08 PM, Wu Fengguang wrote:
> 
> > We do need some throttling under memory pressure. However stall time
> > more than 1s is not acceptable. A simple congestion_wait() may be
> > better, since it waits on _any_ IO completion (which will likely
> > release a set of PG_reclaim pages) rather than one specific IO
> > completion. This makes much smoother stall time.
> > wait_on_page_writeback() shall really be the last resort.
> > DEF_PRIORITY/3 means 1/16=6.25%, which is closer.
> 
> I agree with the max 1 second stall time, but 6.25% of
> memory could be an awful lot of pages to scan on a system
> with 1TB of memory :)

I totally ignored the 1TB systems out of this topic, because in such
systems, <PAGE_ALLOC_COSTLY_ORDER pages are easily available? :)

> Not sure what the best approach is, just pointing out
> that DEF_PRIORITY/3 may be too much for large systems...

What if DEF_PRIORITY/3 is used under PAGE_ALLOC_COSTLY_ORDER?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-26  3:17                         ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26  3:17 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli, Minchan Kim

On Mon, Jul 26, 2010 at 11:11:37AM +0800, Rik van Riel wrote:
> On 07/25/2010 11:08 PM, Wu Fengguang wrote:
> 
> > We do need some throttling under memory pressure. However stall time
> > more than 1s is not acceptable. A simple congestion_wait() may be
> > better, since it waits on _any_ IO completion (which will likely
> > release a set of PG_reclaim pages) rather than one specific IO
> > completion. This makes much smoother stall time.
> > wait_on_page_writeback() shall really be the last resort.
> > DEF_PRIORITY/3 means 1/16=6.25%, which is closer.
> 
> I agree with the max 1 second stall time, but 6.25% of
> memory could be an awful lot of pages to scan on a system
> with 1TB of memory :)

I totally ignored the 1TB systems out of this topic, because in such
systems, <PAGE_ALLOC_COSTLY_ORDER pages are easily available? :)

> Not sure what the best approach is, just pointing out
> that DEF_PRIORITY/3 may be too much for large systems...

What if DEF_PRIORITY/3 is used under PAGE_ALLOC_COSTLY_ORDER?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-25 12:03                     ` Minchan Kim
@ 2010-07-26  3:27                       ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26  3:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli

On Sun, Jul 25, 2010 at 08:03:45PM +0800, Minchan Kim wrote:
> On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote:
> > Hi
> > 
> > sorry for the delay.
> > 
> > > Will you be picking it up or should I? The changelog should be more or less
> > > the same as yours and consider it
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > 
> > > It'd be nice if the original tester is still knocking around and willing
> > > to confirm the patch resolves his/her problem. I am running this patch on
> > > my desktop at the moment and it does feel a little smoother but it might be
> > > my imagination. I had trouble with odd stalls that I never pinned down and
> > > was attributing to the machine being commonly heavily loaded but I haven't
> > > noticed them today.
> > > 
> > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
> > > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
> > > should use PAGEOUT_IO_SYNC]
> > 
> > My reviewing doesn't found any bug. however I think original thread have too many guess
> > and we need to know reproduce way and confirm it.
> > 
> > At least, we need three confirms.
> >  o original issue is still there?
> >  o DEF_PRIORITY/3 is best value?
> 
> I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU?
> I guess system has 512M and 22M writeback pages. 
> So you may determine it for skipping max 32M writeback pages.
> Is right?

For 512M mem, DEF_PRIORITY/3 means 32M dirty _or_ writeback pages.
Because shrink_inactive_list() first calls
shrink_page_list(PAGEOUT_IO_ASYNC) then optionally 
shrink_page_list(PAGEOUT_IO_SYNC), so dirty pages will first be
converted to writeback pages and then optionally be waited on.

The dirty/writeback pages may go up to 512M*20% = 100M. So 32M looks
a reasonable value.

> And I have a question of your below comment. 
> 
> "As the default dirty throttle ratio is 20%, sync write&wait
> will hardly be triggered by pure dirty pages"
> 
> I am not sure exactly what you mean but at least DEF_PRIOIRTY/3 seems to be
> related to dirty_ratio. It always can be changed by admin.
> Then do we have to determine magic value(DEF_PRIORITY/3)  proportional to dirty_ratio?

Yes DEF_PRIORITY/3 is already proportional to the _default_
dirty_ratio. We could do explicit comparison with dirty_ratio
just in case dirty_ratio get changed by user. It's mainly a question
of whether deserving to add such overheads and complexity. I'd prefer
to keep the current simple form :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-26  3:27                       ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26  3:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli

On Sun, Jul 25, 2010 at 08:03:45PM +0800, Minchan Kim wrote:
> On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote:
> > Hi
> > 
> > sorry for the delay.
> > 
> > > Will you be picking it up or should I? The changelog should be more or less
> > > the same as yours and consider it
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > 
> > > It'd be nice if the original tester is still knocking around and willing
> > > to confirm the patch resolves his/her problem. I am running this patch on
> > > my desktop at the moment and it does feel a little smoother but it might be
> > > my imagination. I had trouble with odd stalls that I never pinned down and
> > > was attributing to the machine being commonly heavily loaded but I haven't
> > > noticed them today.
> > > 
> > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
> > > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
> > > should use PAGEOUT_IO_SYNC]
> > 
> > My reviewing doesn't found any bug. however I think original thread have too many guess
> > and we need to know reproduce way and confirm it.
> > 
> > At least, we need three confirms.
> >  o original issue is still there?
> >  o DEF_PRIORITY/3 is best value?
> 
> I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU?
> I guess system has 512M and 22M writeback pages. 
> So you may determine it for skipping max 32M writeback pages.
> Is right?

For 512M mem, DEF_PRIORITY/3 means 32M dirty _or_ writeback pages.
Because shrink_inactive_list() first calls
shrink_page_list(PAGEOUT_IO_ASYNC) then optionally 
shrink_page_list(PAGEOUT_IO_SYNC), so dirty pages will first be
converted to writeback pages and then optionally be waited on.

The dirty/writeback pages may go up to 512M*20% = 100M. So 32M looks
a reasonable value.

> And I have a question of your below comment. 
> 
> "As the default dirty throttle ratio is 20%, sync write&wait
> will hardly be triggered by pure dirty pages"
> 
> I am not sure exactly what you mean but at least DEF_PRIOIRTY/3 seems to be
> related to dirty_ratio. It always can be changed by admin.
> Then do we have to determine magic value(DEF_PRIORITY/3)  proportional to dirty_ratio?

Yes DEF_PRIORITY/3 is already proportional to the _default_
dirty_ratio. We could do explicit comparison with dirty_ratio
just in case dirty_ratio get changed by user. It's mainly a question
of whether deserving to add such overheads and complexity. I'd prefer
to keep the current simple form :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background  writeback
  2010-07-26  3:27                       ` Wu Fengguang
@ 2010-07-26  4:11                         ` Minchan Kim
  -1 siblings, 0 replies; 177+ messages in thread
From: Minchan Kim @ 2010-07-26  4:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 26, 2010 at 12:27 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Sun, Jul 25, 2010 at 08:03:45PM +0800, Minchan Kim wrote:
>> On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote:
>> > Hi
>> >
>> > sorry for the delay.
>> >
>> > > Will you be picking it up or should I? The changelog should be more or less
>> > > the same as yours and consider it
>> > >
>> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>> > >
>> > > It'd be nice if the original tester is still knocking around and willing
>> > > to confirm the patch resolves his/her problem. I am running this patch on
>> > > my desktop at the moment and it does feel a little smoother but it might be
>> > > my imagination. I had trouble with odd stalls that I never pinned down and
>> > > was attributing to the machine being commonly heavily loaded but I haven't
>> > > noticed them today.
>> > >
>> > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
>> > > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
>> > > should use PAGEOUT_IO_SYNC]
>> >
>> > My reviewing doesn't found any bug. however I think original thread have too many guess
>> > and we need to know reproduce way and confirm it.
>> >
>> > At least, we need three confirms.
>> >  o original issue is still there?
>> >  o DEF_PRIORITY/3 is best value?
>>
>> I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU?
>> I guess system has 512M and 22M writeback pages.
>> So you may determine it for skipping max 32M writeback pages.
>> Is right?
>
> For 512M mem, DEF_PRIORITY/3 means 32M dirty _or_ writeback pages.
> Because shrink_inactive_list() first calls
> shrink_page_list(PAGEOUT_IO_ASYNC) then optionally
> shrink_page_list(PAGEOUT_IO_SYNC), so dirty pages will first be
> converted to writeback pages and then optionally be waited on.
>
> The dirty/writeback pages may go up to 512M*20% = 100M. So 32M looks
> a reasonable value.

Why do you think it's a reasonable value?
I mean why isn't it good 12.5% or 3.125%? Why do you select 6.25%?
I am not against you. Just out of curiosity and requires more explanation.
It might be thing _only I_ don't know. :(

>
>> And I have a question of your below comment.
>>
>> "As the default dirty throttle ratio is 20%, sync write&wait
>> will hardly be triggered by pure dirty pages"
>>
>> I am not sure exactly what you mean but at least DEF_PRIOIRTY/3 seems to be
>> related to dirty_ratio. It always can be changed by admin.
>> Then do we have to determine magic value(DEF_PRIORITY/3)  proportional to dirty_ratio?
>
> Yes DEF_PRIORITY/3 is already proportional to the _default_
> dirty_ratio. We could do explicit comparison with dirty_ratio
> just in case dirty_ratio get changed by user. It's mainly a question
> of whether deserving to add such overheads and complexity. I'd prefer
> to keep the current simple form :)

What I suggest is that couldn't we use recent_writeback/recent_scanned ratio?
I think scan_control's new filed and counting wouldn't be a big
overhead and complexity.
I am not sure which ratio is best. but at least, it would make the
logic scalable and sense to me. :)

>
> Thanks,
> Fengguang
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-26  4:11                         ` Minchan Kim
  0 siblings, 0 replies; 177+ messages in thread
From: Minchan Kim @ 2010-07-26  4:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 26, 2010 at 12:27 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Sun, Jul 25, 2010 at 08:03:45PM +0800, Minchan Kim wrote:
>> On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote:
>> > Hi
>> >
>> > sorry for the delay.
>> >
>> > > Will you be picking it up or should I? The changelog should be more or less
>> > > the same as yours and consider it
>> > >
>> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>> > >
>> > > It'd be nice if the original tester is still knocking around and willing
>> > > to confirm the patch resolves his/her problem. I am running this patch on
>> > > my desktop at the moment and it does feel a little smoother but it might be
>> > > my imagination. I had trouble with odd stalls that I never pinned down and
>> > > was attributing to the machine being commonly heavily loaded but I haven't
>> > > noticed them today.
>> > >
>> > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
>> > > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
>> > > should use PAGEOUT_IO_SYNC]
>> >
>> > My reviewing doesn't found any bug. however I think original thread have too many guess
>> > and we need to know reproduce way and confirm it.
>> >
>> > At least, we need three confirms.
>> >  o original issue is still there?
>> >  o DEF_PRIORITY/3 is best value?
>>
>> I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU?
>> I guess system has 512M and 22M writeback pages.
>> So you may determine it for skipping max 32M writeback pages.
>> Is right?
>
> For 512M mem, DEF_PRIORITY/3 means 32M dirty _or_ writeback pages.
> Because shrink_inactive_list() first calls
> shrink_page_list(PAGEOUT_IO_ASYNC) then optionally
> shrink_page_list(PAGEOUT_IO_SYNC), so dirty pages will first be
> converted to writeback pages and then optionally be waited on.
>
> The dirty/writeback pages may go up to 512M*20% = 100M. So 32M looks
> a reasonable value.

Why do you think it's a reasonable value?
I mean why isn't it good 12.5% or 3.125%? Why do you select 6.25%?
I am not against you. Just out of curiosity and requires more explanation.
It might be thing _only I_ don't know. :(

>
>> And I have a question of your below comment.
>>
>> "As the default dirty throttle ratio is 20%, sync write&wait
>> will hardly be triggered by pure dirty pages"
>>
>> I am not sure exactly what you mean but at least DEF_PRIOIRTY/3 seems to be
>> related to dirty_ratio. It always can be changed by admin.
>> Then do we have to determine magic value(DEF_PRIORITY/3)  proportional to dirty_ratio?
>
> Yes DEF_PRIORITY/3 is already proportional to the _default_
> dirty_ratio. We could do explicit comparison with dirty_ratio
> just in case dirty_ratio get changed by user. It's mainly a question
> of whether deserving to add such overheads and complexity. I'd prefer
> to keep the current simple form :)

What I suggest is that couldn't we use recent_writeback/recent_scanned ratio?
I think scan_control's new filed and counting wouldn't be a big
overhead and complexity.
I am not sure which ratio is best. but at least, it would make the
logic scalable and sense to me. :)

>
> Thanks,
> Fengguang
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-26  4:11                         ` Minchan Kim
  (?)
@ 2010-07-26  4:37                           ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26  4:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 26, 2010 at 12:11:59PM +0800, Minchan Kim wrote:
> On Mon, Jul 26, 2010 at 12:27 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Sun, Jul 25, 2010 at 08:03:45PM +0800, Minchan Kim wrote:
> >> On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote:
> >> > Hi
> >> >
> >> > sorry for the delay.
> >> >
> >> > > Will you be picking it up or should I? The changelog should be more or less
> >> > > the same as yours and consider it
> >> > >
> >> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >> > >
> >> > > It'd be nice if the original tester is still knocking around and willing
> >> > > to confirm the patch resolves his/her problem. I am running this patch on
> >> > > my desktop at the moment and it does feel a little smoother but it might be
> >> > > my imagination. I had trouble with odd stalls that I never pinned down and
> >> > > was attributing to the machine being commonly heavily loaded but I haven't
> >> > > noticed them today.
> >> > >
> >> > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
> >> > > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
> >> > > should use PAGEOUT_IO_SYNC]
> >> >
> >> > My reviewing doesn't found any bug. however I think original thread have too many guess
> >> > and we need to know reproduce way and confirm it.
> >> >
> >> > At least, we need three confirms.
> >> >  o original issue is still there?
> >> >  o DEF_PRIORITY/3 is best value?
> >>
> >> I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU?
> >> I guess system has 512M and 22M writeback pages.
> >> So you may determine it for skipping max 32M writeback pages.
> >> Is right?
> >
> > For 512M mem, DEF_PRIORITY/3 means 32M dirty _or_ writeback pages.
> > Because shrink_inactive_list() first calls
> > shrink_page_list(PAGEOUT_IO_ASYNC) then optionally
> > shrink_page_list(PAGEOUT_IO_SYNC), so dirty pages will first be
> > converted to writeback pages and then optionally be waited on.
> >
> > The dirty/writeback pages may go up to 512M*20% = 100M. So 32M looks
> > a reasonable value.
> 
> Why do you think it's a reasonable value?
> I mean why isn't it good 12.5% or 3.125%? Why do you select 6.25%?
> I am not against you. Just out of curiosity and requires more explanation.
> It might be thing _only I_ don't know. :(

It's more or less random selected. I'm also OK with 3.125%. It's an
threshold to turn on some _last resort_ mechanism, so don't need to be
optimal..

> >
> >> And I have a question of your below comment.
> >>
> >> "As the default dirty throttle ratio is 20%, sync write&wait
> >> will hardly be triggered by pure dirty pages"
> >>
> >> I am not sure exactly what you mean but at least DEF_PRIOIRTY/3 seems to be
> >> related to dirty_ratio. It always can be changed by admin.
> >> Then do we have to determine magic value(DEF_PRIORITY/3)  proportional to dirty_ratio?
> >
> > Yes DEF_PRIORITY/3 is already proportional to the _default_
> > dirty_ratio. We could do explicit comparison with dirty_ratio
> > just in case dirty_ratio get changed by user. It's mainly a question
> > of whether deserving to add such overheads and complexity. I'd prefer
> > to keep the current simple form :)
> 
> What I suggest is that couldn't we use recent_writeback/recent_scanned ratio?
> I think scan_control's new filed and counting wouldn't be a big
> overhead and complexity.
> I am not sure which ratio is best. but at least, it would make the
> logic scalable and sense to me. :)

..and don't need to be elaborated :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-26  4:37                           ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26  4:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 26, 2010 at 12:11:59PM +0800, Minchan Kim wrote:
> On Mon, Jul 26, 2010 at 12:27 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Sun, Jul 25, 2010 at 08:03:45PM +0800, Minchan Kim wrote:
> >> On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote:
> >> > Hi
> >> >
> >> > sorry for the delay.
> >> >
> >> > > Will you be picking it up or should I? The changelog should be more or less
> >> > > the same as yours and consider it
> >> > >
> >> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >> > >
> >> > > It'd be nice if the original tester is still knocking around and willing
> >> > > to confirm the patch resolves his/her problem. I am running this patch on
> >> > > my desktop at the moment and it does feel a little smoother but it might be
> >> > > my imagination. I had trouble with odd stalls that I never pinned down and
> >> > > was attributing to the machine being commonly heavily loaded but I haven't
> >> > > noticed them today.
> >> > >
> >> > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
> >> > > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
> >> > > should use PAGEOUT_IO_SYNC]
> >> >
> >> > My reviewing doesn't found any bug. however I think original thread have too many guess
> >> > and we need to know reproduce way and confirm it.
> >> >
> >> > At least, we need three confirms.
> >> >  o original issue is still there?
> >> >  o DEF_PRIORITY/3 is best value?
> >>
> >> I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU?
> >> I guess system has 512M and 22M writeback pages.
> >> So you may determine it for skipping max 32M writeback pages.
> >> Is right?
> >
> > For 512M mem, DEF_PRIORITY/3 means 32M dirty _or_ writeback pages.
> > Because shrink_inactive_list() first calls
> > shrink_page_list(PAGEOUT_IO_ASYNC) then optionally
> > shrink_page_list(PAGEOUT_IO_SYNC), so dirty pages will first be
> > converted to writeback pages and then optionally be waited on.
> >
> > The dirty/writeback pages may go up to 512M*20% = 100M. So 32M looks
> > a reasonable value.
> 
> Why do you think it's a reasonable value?
> I mean why isn't it good 12.5% or 3.125%? Why do you select 6.25%?
> I am not against you. Just out of curiosity and requires more explanation.
> It might be thing _only I_ don't know. :(

It's more or less random selected. I'm also OK with 3.125%. It's an
threshold to turn on some _last resort_ mechanism, so don't need to be
optimal..

> >
> >> And I have a question of your below comment.
> >>
> >> "As the default dirty throttle ratio is 20%, sync write&wait
> >> will hardly be triggered by pure dirty pages"
> >>
> >> I am not sure exactly what you mean but at least DEF_PRIOIRTY/3 seems to be
> >> related to dirty_ratio. It always can be changed by admin.
> >> Then do we have to determine magic value(DEF_PRIORITY/3)  proportional to dirty_ratio?
> >
> > Yes DEF_PRIORITY/3 is already proportional to the _default_
> > dirty_ratio. We could do explicit comparison with dirty_ratio
> > just in case dirty_ratio get changed by user. It's mainly a question
> > of whether deserving to add such overheads and complexity. I'd prefer
> > to keep the current simple form :)
> 
> What I suggest is that couldn't we use recent_writeback/recent_scanned ratio?
> I think scan_control's new filed and counting wouldn't be a big
> overhead and complexity.
> I am not sure which ratio is best. but at least, it would make the
> logic scalable and sense to me. :)

..and don't need to be elaborated :)

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-26  4:37                           ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26  4:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 26, 2010 at 12:11:59PM +0800, Minchan Kim wrote:
> On Mon, Jul 26, 2010 at 12:27 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Sun, Jul 25, 2010 at 08:03:45PM +0800, Minchan Kim wrote:
> >> On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote:
> >> > Hi
> >> >
> >> > sorry for the delay.
> >> >
> >> > > Will you be picking it up or should I? The changelog should be more or less
> >> > > the same as yours and consider it
> >> > >
> >> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >> > >
> >> > > It'd be nice if the original tester is still knocking around and willing
> >> > > to confirm the patch resolves his/her problem. I am running this patch on
> >> > > my desktop at the moment and it does feel a little smoother but it might be
> >> > > my imagination. I had trouble with odd stalls that I never pinned down and
> >> > > was attributing to the machine being commonly heavily loaded but I haven't
> >> > > noticed them today.
> >> > >
> >> > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
> >> > > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
> >> > > should use PAGEOUT_IO_SYNC]
> >> >
> >> > My reviewing doesn't found any bug. however I think original thread have too many guess
> >> > and we need to know reproduce way and confirm it.
> >> >
> >> > At least, we need three confirms.
> >> > A o original issue is still there?
> >> > A o DEF_PRIORITY/3 is best value?
> >>
> >> I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU?
> >> I guess system has 512M and 22M writeback pages.
> >> So you may determine it for skipping max 32M writeback pages.
> >> Is right?
> >
> > For 512M mem, DEF_PRIORITY/3 means 32M dirty _or_ writeback pages.
> > Because shrink_inactive_list() first calls
> > shrink_page_list(PAGEOUT_IO_ASYNC) then optionally
> > shrink_page_list(PAGEOUT_IO_SYNC), so dirty pages will first be
> > converted to writeback pages and then optionally be waited on.
> >
> > The dirty/writeback pages may go up to 512M*20% = 100M. So 32M looks
> > a reasonable value.
> 
> Why do you think it's a reasonable value?
> I mean why isn't it good 12.5% or 3.125%? Why do you select 6.25%?
> I am not against you. Just out of curiosity and requires more explanation.
> It might be thing _only I_ don't know. :(

It's more or less random selected. I'm also OK with 3.125%. It's an
threshold to turn on some _last resort_ mechanism, so don't need to be
optimal..

> >
> >> And I have a question of your below comment.
> >>
> >> "As the default dirty throttle ratio is 20%, sync write&wait
> >> will hardly be triggered by pure dirty pages"
> >>
> >> I am not sure exactly what you mean but at least DEF_PRIOIRTY/3 seems to be
> >> related to dirty_ratio. It always can be changed by admin.
> >> Then do we have to determine magic value(DEF_PRIORITY/3) A proportional to dirty_ratio?
> >
> > Yes DEF_PRIORITY/3 is already proportional to the _default_
> > dirty_ratio. We could do explicit comparison with dirty_ratio
> > just in case dirty_ratio get changed by user. It's mainly a question
> > of whether deserving to add such overheads and complexity. I'd prefer
> > to keep the current simple form :)
> 
> What I suggest is that couldn't we use recent_writeback/recent_scanned ratio?
> I think scan_control's new filed and counting wouldn't be a big
> overhead and complexity.
> I am not sure which ratio is best. but at least, it would make the
> logic scalable and sense to me. :)

..and don't need to be elaborated :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-19 13:11   ` Mel Gorman
@ 2010-07-26  7:28     ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26  7:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 09:11:30PM +0800, Mel Gorman wrote:
> There are a number of cases where pages get cleaned but two of concern
> to this patch are;
>   o When dirtying pages, processes may be throttled to clean pages if
>     dirty_ratio is not met.
>   o Pages belonging to inodes dirtied longer than
>     dirty_writeback_centisecs get cleaned.
> 
> The problem for reclaim is that dirty pages can reach the end of the LRU
> if pages are being dirtied slowly so that neither the throttling cleans
> them or a flusher thread waking periodically.
> 
> Background flush is already cleaning old or expired inodes first but the
> expire time is too far in the future at the time of page reclaim. To mitigate
> future problems, this patch wakes flusher threads to clean 1.5 times the
> number of dirty pages encountered by reclaimers. The reasoning is that pages
> were being dirtied at a roughly constant rate recently so if N dirty pages
> were encountered in this scan block, we are likely to see roughly N dirty
> pages again soon so try keep the flusher threads ahead of reclaim.
> 
> This is unfortunately very hand-wavy but there is not really a good way of
> quantifying how bad it is when reclaim encounters dirty pages other than
> "down with that sort of thing". Similarly, there is not an obvious way of
> figuring how what percentage of dirty pages are old in terms of LRU-age and
> should be cleaned. Ideally, the background flushers would only be cleaning
> pages belonging to the zone being scanned but it's not clear if this would
> be of benefit (less IO) or not (potentially less efficient IO if an inode
> is scattered across multiple zones).
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/vmscan.c |   18 +++++++++++-------
>  1 files changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bc50937..5763719 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -806,6 +806,8 @@ restart_dirty:
>  		}
>  
>  		if (PageDirty(page))  {
> +			nr_dirty++;
> +
>  			/*
>  			 * If the caller cannot writeback pages, dirty pages
>  			 * are put on a separate list for cleaning by either
> @@ -814,7 +816,6 @@ restart_dirty:
>  			if (!reclaim_can_writeback(sc, page)) {
>  				list_add(&page->lru, &dirty_pages);
>  				unlock_page(page);
> -				nr_dirty++;
>  				goto keep_dirty;
>  			}
>  
> @@ -933,13 +934,16 @@ keep_dirty:
>  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
>  	}
>  
> +	/*
> +	 * If reclaim is encountering dirty pages, it may be because
> +	 * dirty pages are reaching the end of the LRU even though
> +	 * the dirty_ratio may be satisified. In this case, wake
> +	 * flusher threads to pro-actively clean some pages
> +	 */
> +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);

Ah it's very possible that nr_dirty==0 here! Then you are hitting the
number of dirty pages down to 0 whether or not pageout() is called.

Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
normally a small number, much smaller than MAX_WRITEBACK_PAGES.
The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
for efficiency. And it seems good to let the flusher write much more
than nr_dirty pages to safeguard a reasonable large
vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
update the comments.

Thanks,
Fengguang

>  	if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
> -		/*
> -		 * Wakeup a flusher thread to clean at least as many dirty
> -		 * pages as encountered by direct reclaim. Wait on congestion
> -		 * to throttle processes cleaning dirty pages
> -		 */
> -		wakeup_flusher_threads(nr_dirty);
> +		/* Throttle direct reclaimers cleaning pages */
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  		/*
> -- 
> 1.7.1

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-26  7:28     ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26  7:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 19, 2010 at 09:11:30PM +0800, Mel Gorman wrote:
> There are a number of cases where pages get cleaned but two of concern
> to this patch are;
>   o When dirtying pages, processes may be throttled to clean pages if
>     dirty_ratio is not met.
>   o Pages belonging to inodes dirtied longer than
>     dirty_writeback_centisecs get cleaned.
> 
> The problem for reclaim is that dirty pages can reach the end of the LRU
> if pages are being dirtied slowly so that neither the throttling cleans
> them or a flusher thread waking periodically.
> 
> Background flush is already cleaning old or expired inodes first but the
> expire time is too far in the future at the time of page reclaim. To mitigate
> future problems, this patch wakes flusher threads to clean 1.5 times the
> number of dirty pages encountered by reclaimers. The reasoning is that pages
> were being dirtied at a roughly constant rate recently so if N dirty pages
> were encountered in this scan block, we are likely to see roughly N dirty
> pages again soon so try keep the flusher threads ahead of reclaim.
> 
> This is unfortunately very hand-wavy but there is not really a good way of
> quantifying how bad it is when reclaim encounters dirty pages other than
> "down with that sort of thing". Similarly, there is not an obvious way of
> figuring how what percentage of dirty pages are old in terms of LRU-age and
> should be cleaned. Ideally, the background flushers would only be cleaning
> pages belonging to the zone being scanned but it's not clear if this would
> be of benefit (less IO) or not (potentially less efficient IO if an inode
> is scattered across multiple zones).
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/vmscan.c |   18 +++++++++++-------
>  1 files changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bc50937..5763719 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -806,6 +806,8 @@ restart_dirty:
>  		}
>  
>  		if (PageDirty(page))  {
> +			nr_dirty++;
> +
>  			/*
>  			 * If the caller cannot writeback pages, dirty pages
>  			 * are put on a separate list for cleaning by either
> @@ -814,7 +816,6 @@ restart_dirty:
>  			if (!reclaim_can_writeback(sc, page)) {
>  				list_add(&page->lru, &dirty_pages);
>  				unlock_page(page);
> -				nr_dirty++;
>  				goto keep_dirty;
>  			}
>  
> @@ -933,13 +934,16 @@ keep_dirty:
>  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
>  	}
>  
> +	/*
> +	 * If reclaim is encountering dirty pages, it may be because
> +	 * dirty pages are reaching the end of the LRU even though
> +	 * the dirty_ratio may be satisified. In this case, wake
> +	 * flusher threads to pro-actively clean some pages
> +	 */
> +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);

Ah it's very possible that nr_dirty==0 here! Then you are hitting the
number of dirty pages down to 0 whether or not pageout() is called.

Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
normally a small number, much smaller than MAX_WRITEBACK_PAGES.
The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
for efficiency. And it seems good to let the flusher write much more
than nr_dirty pages to safeguard a reasonable large
vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
update the comments.

Thanks,
Fengguang

>  	if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
> -		/*
> -		 * Wakeup a flusher thread to clean at least as many dirty
> -		 * pages as encountered by direct reclaim. Wait on congestion
> -		 * to throttle processes cleaning dirty pages
> -		 */
> -		wakeup_flusher_threads(nr_dirty);
> +		/* Throttle direct reclaimers cleaning pages */
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  		/*
> -- 
> 1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-21 13:38               ` Mel Gorman
@ 2010-07-26  8:29                 ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26  8:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli

> ==== CUT HERE ====
> vmscan: Do not writeback filesystem pages in direct reclaim
> 
> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
> 
> This patch prevents direct reclaim writing back filesystem pages by checking
> if current is kswapd or the page is anonymous before writing back.  If the
> dirty pages cannot be written back, they are placed back on the LRU lists
> for either background writing by the BDI threads or kswapd. If in direct
> lumpy reclaim and dirty pages are encountered, the process will stall for
> the background flusher before trying to reclaim the pages again.
> 
> As the call-chain for writing anonymous pages is not expected to be deep
> and they are not cleaned by flusher threads, anonymous pages are still
> written back in direct reclaim.

This is also a good step towards reducing pageout() calls. For better
IO performance the flusher threads should take more work from pageout().

> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>  mm/vmscan.c |   55 +++++++++++++++++++++++++++++++++++++++----------------
>  1 files changed, 39 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6587155..45d9934 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -139,6 +139,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
>  #define scanning_global_lru(sc)        (1)
>  #endif
> 
> +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> +#define MAX_SWAP_CLEAN_WAIT 50
> +
>  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
>                                                   struct scan_control *sc)
>  {
> @@ -644,11 +647,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
>   */
>  static unsigned long shrink_page_list(struct list_head *page_list,
>                                         struct scan_control *sc,
> -                                       enum pageout_io sync_writeback)
> +                                       enum pageout_io sync_writeback,
> +                                       unsigned long *nr_still_dirty)
>  {
>         LIST_HEAD(ret_pages);
>         LIST_HEAD(free_pages);
>         int pgactivate = 0;
> +       unsigned long nr_dirty = 0;
>         unsigned long nr_reclaimed = 0;
> 
>         cond_resched();
> @@ -742,6 +747,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                 }
> 
>                 if (PageDirty(page)) {
> +                       /*
> +                        * Only kswapd can writeback filesystem pages to
> +                        * avoid risk of stack overflow
> +                        */
> +                       if (page_is_file_cache(page) && !current_is_kswapd()) {
> +                               nr_dirty++;
> +                               goto keep_locked;
> +                       }
> +
>                         if (references == PAGEREF_RECLAIM_CLEAN)
>                                 goto keep_locked;
>                         if (!may_enter_fs)
> @@ -858,7 +872,7 @@ keep:
> 
>         free_page_list(&free_pages);
> 
> -       list_splice(&ret_pages, page_list);
> +       *nr_still_dirty = nr_dirty;
>         count_vm_events(PGACTIVATE, pgactivate);
>         return nr_reclaimed;
>  }
> @@ -1245,6 +1259,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>         unsigned long nr_active;
>         unsigned long nr_anon;
>         unsigned long nr_file;
> +       unsigned long nr_dirty;
> 
>         while (unlikely(too_many_isolated(zone, file, sc))) {
>                 congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -1293,26 +1308,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> 
>         spin_unlock_irq(&zone->lru_lock);
> 
> -       nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> +       nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> +                                                               &nr_dirty);
> 
>         /*
> -        * If we are direct reclaiming for contiguous pages and we do
> +        * If specific pages are needed such as with direct reclaiming
> +        * for contiguous pages or for memory containers and we do
>          * not reclaim everything in the list, try again and wait
> -        * for IO to complete. This will stall high-order allocations
> -        * but that should be acceptable to the caller
> +        * for IO to complete. This will stall callers that require
> +        * specific pages but it should be acceptable to the caller
>          */
> -       if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> -                       sc->lumpy_reclaim_mode) {
> -               congestion_wait(BLK_RW_ASYNC, HZ/10);
> +       if (sc->may_writepage && !current_is_kswapd() &&
> +                       (sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> +               int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> 
> -               /*
> -                * The attempt at page out may have made some
> -                * of the pages active, mark them inactive again.
> -                */
> -               nr_active = clear_active_flags(&page_list, NULL);
> -               count_vm_events(PGDEACTIVATE, nr_active);
> +               while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> +                       wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> +                       congestion_wait(BLK_RW_ASYNC, HZ/10);

It needs good luck for the flusher threads to "happen to" sync the
dirty pages in our page_list. I'd rather take the logic as "there are
too many dirty pages, shrink them to avoid some future pageout() calls
and/or congestion_wait() stalls".

So the loop is likely to repeat MAX_SWAP_CLEAN_WAIT times.  Let's remove it?

> -               nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> +                       /*
> +                        * The attempt at page out may have made some
> +                        * of the pages active, mark them inactive again.
> +                        */
> +                       nr_active = clear_active_flags(&page_list, NULL);
> +                       count_vm_events(PGDEACTIVATE, nr_active);
> +
> +                       nr_reclaimed += shrink_page_list(&page_list, sc,
> +                                               PAGEOUT_IO_SYNC, &nr_dirty);

This shrink_page_list() won't be called at all if nr_dirty==0 and
pageout() was called. This is a change of behavior. It can also be
fixed by removing the loop.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-26  8:29                 ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26  8:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli

> ==== CUT HERE ====
> vmscan: Do not writeback filesystem pages in direct reclaim
> 
> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
> 
> This patch prevents direct reclaim writing back filesystem pages by checking
> if current is kswapd or the page is anonymous before writing back.  If the
> dirty pages cannot be written back, they are placed back on the LRU lists
> for either background writing by the BDI threads or kswapd. If in direct
> lumpy reclaim and dirty pages are encountered, the process will stall for
> the background flusher before trying to reclaim the pages again.
> 
> As the call-chain for writing anonymous pages is not expected to be deep
> and they are not cleaned by flusher threads, anonymous pages are still
> written back in direct reclaim.

This is also a good step towards reducing pageout() calls. For better
IO performance the flusher threads should take more work from pageout().

> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>  mm/vmscan.c |   55 +++++++++++++++++++++++++++++++++++++++----------------
>  1 files changed, 39 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6587155..45d9934 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -139,6 +139,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
>  #define scanning_global_lru(sc)        (1)
>  #endif
> 
> +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> +#define MAX_SWAP_CLEAN_WAIT 50
> +
>  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
>                                                   struct scan_control *sc)
>  {
> @@ -644,11 +647,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
>   */
>  static unsigned long shrink_page_list(struct list_head *page_list,
>                                         struct scan_control *sc,
> -                                       enum pageout_io sync_writeback)
> +                                       enum pageout_io sync_writeback,
> +                                       unsigned long *nr_still_dirty)
>  {
>         LIST_HEAD(ret_pages);
>         LIST_HEAD(free_pages);
>         int pgactivate = 0;
> +       unsigned long nr_dirty = 0;
>         unsigned long nr_reclaimed = 0;
> 
>         cond_resched();
> @@ -742,6 +747,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                 }
> 
>                 if (PageDirty(page)) {
> +                       /*
> +                        * Only kswapd can writeback filesystem pages to
> +                        * avoid risk of stack overflow
> +                        */
> +                       if (page_is_file_cache(page) && !current_is_kswapd()) {
> +                               nr_dirty++;
> +                               goto keep_locked;
> +                       }
> +
>                         if (references == PAGEREF_RECLAIM_CLEAN)
>                                 goto keep_locked;
>                         if (!may_enter_fs)
> @@ -858,7 +872,7 @@ keep:
> 
>         free_page_list(&free_pages);
> 
> -       list_splice(&ret_pages, page_list);
> +       *nr_still_dirty = nr_dirty;
>         count_vm_events(PGACTIVATE, pgactivate);
>         return nr_reclaimed;
>  }
> @@ -1245,6 +1259,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>         unsigned long nr_active;
>         unsigned long nr_anon;
>         unsigned long nr_file;
> +       unsigned long nr_dirty;
> 
>         while (unlikely(too_many_isolated(zone, file, sc))) {
>                 congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -1293,26 +1308,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> 
>         spin_unlock_irq(&zone->lru_lock);
> 
> -       nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> +       nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> +                                                               &nr_dirty);
> 
>         /*
> -        * If we are direct reclaiming for contiguous pages and we do
> +        * If specific pages are needed such as with direct reclaiming
> +        * for contiguous pages or for memory containers and we do
>          * not reclaim everything in the list, try again and wait
> -        * for IO to complete. This will stall high-order allocations
> -        * but that should be acceptable to the caller
> +        * for IO to complete. This will stall callers that require
> +        * specific pages but it should be acceptable to the caller
>          */
> -       if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> -                       sc->lumpy_reclaim_mode) {
> -               congestion_wait(BLK_RW_ASYNC, HZ/10);
> +       if (sc->may_writepage && !current_is_kswapd() &&
> +                       (sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> +               int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> 
> -               /*
> -                * The attempt at page out may have made some
> -                * of the pages active, mark them inactive again.
> -                */
> -               nr_active = clear_active_flags(&page_list, NULL);
> -               count_vm_events(PGDEACTIVATE, nr_active);
> +               while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> +                       wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> +                       congestion_wait(BLK_RW_ASYNC, HZ/10);

It needs good luck for the flusher threads to "happen to" sync the
dirty pages in our page_list. I'd rather take the logic as "there are
too many dirty pages, shrink them to avoid some future pageout() calls
and/or congestion_wait() stalls".

So the loop is likely to repeat MAX_SWAP_CLEAN_WAIT times.  Let's remove it?

> -               nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> +                       /*
> +                        * The attempt at page out may have made some
> +                        * of the pages active, mark them inactive again.
> +                        */
> +                       nr_active = clear_active_flags(&page_list, NULL);
> +                       count_vm_events(PGDEACTIVATE, nr_active);
> +
> +                       nr_reclaimed += shrink_page_list(&page_list, sc,
> +                                               PAGEOUT_IO_SYNC, &nr_dirty);

This shrink_page_list() won't be called at all if nr_dirty==0 and
pageout() was called. This is a change of behavior. It can also be
fixed by removing the loop.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-26  8:29                 ` Wu Fengguang
@ 2010-07-26  9:12                   ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-26  9:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli

On Mon, Jul 26, 2010 at 04:29:35PM +0800, Wu Fengguang wrote:
> > ==== CUT HERE ====
> > vmscan: Do not writeback filesystem pages in direct reclaim
> > 
> > When memory is under enough pressure, a process may enter direct
> > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > encountered during the scan, this page is written to backing storage using
> > mapping->writepage. This can result in very deep call stacks, particularly
> > if the target storage or filesystem are complex. It has already been observed
> > on XFS that the stack overflows but the problem is not XFS-specific.
> > 
> > This patch prevents direct reclaim writing back filesystem pages by checking
> > if current is kswapd or the page is anonymous before writing back.  If the
> > dirty pages cannot be written back, they are placed back on the LRU lists
> > for either background writing by the BDI threads or kswapd. If in direct
> > lumpy reclaim and dirty pages are encountered, the process will stall for
> > the background flusher before trying to reclaim the pages again.
> > 
> > As the call-chain for writing anonymous pages is not expected to be deep
> > and they are not cleaned by flusher threads, anonymous pages are still
> > written back in direct reclaim.
> 
> This is also a good step towards reducing pageout() calls. For better
> IO performance the flusher threads should take more work from pageout().
> 

This is true for better IO performance all right but reclaim does require
specific pages cleaned. The strict requirement is when lumpy reclaim is
involved but a looser requirement is when any pages within a zone be cleaned.

> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> >  mm/vmscan.c |   55 +++++++++++++++++++++++++++++++++++++++----------------
> >  1 files changed, 39 insertions(+), 16 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 6587155..45d9934 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -139,6 +139,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
> >  #define scanning_global_lru(sc)        (1)
> >  #endif
> > 
> > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> > +#define MAX_SWAP_CLEAN_WAIT 50
> > +
> >  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> >                                                   struct scan_control *sc)
> >  {
> > @@ -644,11 +647,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> >   */
> >  static unsigned long shrink_page_list(struct list_head *page_list,
> >                                         struct scan_control *sc,
> > -                                       enum pageout_io sync_writeback)
> > +                                       enum pageout_io sync_writeback,
> > +                                       unsigned long *nr_still_dirty)
> >  {
> >         LIST_HEAD(ret_pages);
> >         LIST_HEAD(free_pages);
> >         int pgactivate = 0;
> > +       unsigned long nr_dirty = 0;
> >         unsigned long nr_reclaimed = 0;
> > 
> >         cond_resched();
> > @@ -742,6 +747,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >                 }
> > 
> >                 if (PageDirty(page)) {
> > +                       /*
> > +                        * Only kswapd can writeback filesystem pages to
> > +                        * avoid risk of stack overflow
> > +                        */
> > +                       if (page_is_file_cache(page) && !current_is_kswapd()) {
> > +                               nr_dirty++;
> > +                               goto keep_locked;
> > +                       }
> > +
> >                         if (references == PAGEREF_RECLAIM_CLEAN)
> >                                 goto keep_locked;
> >                         if (!may_enter_fs)
> > @@ -858,7 +872,7 @@ keep:
> > 
> >         free_page_list(&free_pages);
> > 
> > -       list_splice(&ret_pages, page_list);
> > +       *nr_still_dirty = nr_dirty;
> >         count_vm_events(PGACTIVATE, pgactivate);
> >         return nr_reclaimed;
> >  }
> > @@ -1245,6 +1259,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >         unsigned long nr_active;
> >         unsigned long nr_anon;
> >         unsigned long nr_file;
> > +       unsigned long nr_dirty;
> > 
> >         while (unlikely(too_many_isolated(zone, file, sc))) {
> >                 congestion_wait(BLK_RW_ASYNC, HZ/10);
> > @@ -1293,26 +1308,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> > 
> >         spin_unlock_irq(&zone->lru_lock);
> > 
> > -       nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> > +       nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> > +                                                               &nr_dirty);
> > 
> >         /*
> > -        * If we are direct reclaiming for contiguous pages and we do
> > +        * If specific pages are needed such as with direct reclaiming
> > +        * for contiguous pages or for memory containers and we do
> >          * not reclaim everything in the list, try again and wait
> > -        * for IO to complete. This will stall high-order allocations
> > -        * but that should be acceptable to the caller
> > +        * for IO to complete. This will stall callers that require
> > +        * specific pages but it should be acceptable to the caller
> >          */
> > -       if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> > -                       sc->lumpy_reclaim_mode) {
> > -               congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +       if (sc->may_writepage && !current_is_kswapd() &&
> > +                       (sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> > +               int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> > 
> > -               /*
> > -                * The attempt at page out may have made some
> > -                * of the pages active, mark them inactive again.
> > -                */
> > -               nr_active = clear_active_flags(&page_list, NULL);
> > -               count_vm_events(PGDEACTIVATE, nr_active);
> > +               while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > +                       wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> > +                       congestion_wait(BLK_RW_ASYNC, HZ/10);
> 
> It needs good luck for the flusher threads to "happen to" sync the
> dirty pages in our page_list.

That is why I'm expecting the "shrink oldest inode" patchset to help. It
still requires a certain amount of luck but callers that encounter dirty
pages will be delayed.

It's also because a certain amount of luck is required that the last patch
in the series aims at reducing the number of dirty pages encountered by
reclaim. The closer that is to 0, the less important the timing of flusher
threads is.

> I'd rather take the logic as "there are
> too many dirty pages, shrink them to avoid some future pageout() calls
> and/or congestion_wait() stalls".
> 

What do you mean by shrink them? They cannot be reclaimed until they are
clean.

> So the loop is likely to repeat MAX_SWAP_CLEAN_WAIT times.  Let's remove it?
> 

This loop only applies to direct reclaimers in lumpy reclaim mode and
memory containers. Both need specific pages to be cleaned and freed.
Hence, the loop is to stall them and wait on flusher threads up to a
point. Otherwise they can cause a reclaim storm of clean pages that
can't be used.

Current tests have not indicated MAX_SWAP_CLEAN_WAIT is regularly reached
but I am inferring this from timing data rather than a direct measurement.

> > -               nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > +                       /*
> > +                        * The attempt at page out may have made some
> > +                        * of the pages active, mark them inactive again.
> > +                        */
> > +                       nr_active = clear_active_flags(&page_list, NULL);
> > +                       count_vm_events(PGDEACTIVATE, nr_active);
> > +
> > +                       nr_reclaimed += shrink_page_list(&page_list, sc,
> > +                                               PAGEOUT_IO_SYNC, &nr_dirty);
> 
> This shrink_page_list() won't be called at all if nr_dirty==0 and
> pageout() was called. This is a change of behavior. It can also be
> fixed by removing the loop.
> 

The whole patch is a change of behaviour but in this case it also makes
sense to focus on just the dirty pages. The first shrink_page_list
decided that the pages could not be unmapped and reclaimed - probably
because it was referenced. This is not likely to change during the loop.

Testing with a version of the patch that processed the full list added
significant stalls when sync writeback was involved. Testing time length
was tripled in one case implying that this loop was continually reaching
MAX_SWAP_CLEAN_WAIT.

The intention of this loop is "wait on dirty pages to be cleaned" and
it's a change of behaviour, but one that makes sense and testing
indicates it's a good idea.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-26  9:12                   ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-26  9:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli

On Mon, Jul 26, 2010 at 04:29:35PM +0800, Wu Fengguang wrote:
> > ==== CUT HERE ====
> > vmscan: Do not writeback filesystem pages in direct reclaim
> > 
> > When memory is under enough pressure, a process may enter direct
> > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > encountered during the scan, this page is written to backing storage using
> > mapping->writepage. This can result in very deep call stacks, particularly
> > if the target storage or filesystem are complex. It has already been observed
> > on XFS that the stack overflows but the problem is not XFS-specific.
> > 
> > This patch prevents direct reclaim writing back filesystem pages by checking
> > if current is kswapd or the page is anonymous before writing back.  If the
> > dirty pages cannot be written back, they are placed back on the LRU lists
> > for either background writing by the BDI threads or kswapd. If in direct
> > lumpy reclaim and dirty pages are encountered, the process will stall for
> > the background flusher before trying to reclaim the pages again.
> > 
> > As the call-chain for writing anonymous pages is not expected to be deep
> > and they are not cleaned by flusher threads, anonymous pages are still
> > written back in direct reclaim.
> 
> This is also a good step towards reducing pageout() calls. For better
> IO performance the flusher threads should take more work from pageout().
> 

This is true for better IO performance all right but reclaim does require
specific pages cleaned. The strict requirement is when lumpy reclaim is
involved but a looser requirement is when any pages within a zone be cleaned.

> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> >  mm/vmscan.c |   55 +++++++++++++++++++++++++++++++++++++++----------------
> >  1 files changed, 39 insertions(+), 16 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 6587155..45d9934 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -139,6 +139,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
> >  #define scanning_global_lru(sc)        (1)
> >  #endif
> > 
> > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> > +#define MAX_SWAP_CLEAN_WAIT 50
> > +
> >  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> >                                                   struct scan_control *sc)
> >  {
> > @@ -644,11 +647,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> >   */
> >  static unsigned long shrink_page_list(struct list_head *page_list,
> >                                         struct scan_control *sc,
> > -                                       enum pageout_io sync_writeback)
> > +                                       enum pageout_io sync_writeback,
> > +                                       unsigned long *nr_still_dirty)
> >  {
> >         LIST_HEAD(ret_pages);
> >         LIST_HEAD(free_pages);
> >         int pgactivate = 0;
> > +       unsigned long nr_dirty = 0;
> >         unsigned long nr_reclaimed = 0;
> > 
> >         cond_resched();
> > @@ -742,6 +747,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >                 }
> > 
> >                 if (PageDirty(page)) {
> > +                       /*
> > +                        * Only kswapd can writeback filesystem pages to
> > +                        * avoid risk of stack overflow
> > +                        */
> > +                       if (page_is_file_cache(page) && !current_is_kswapd()) {
> > +                               nr_dirty++;
> > +                               goto keep_locked;
> > +                       }
> > +
> >                         if (references == PAGEREF_RECLAIM_CLEAN)
> >                                 goto keep_locked;
> >                         if (!may_enter_fs)
> > @@ -858,7 +872,7 @@ keep:
> > 
> >         free_page_list(&free_pages);
> > 
> > -       list_splice(&ret_pages, page_list);
> > +       *nr_still_dirty = nr_dirty;
> >         count_vm_events(PGACTIVATE, pgactivate);
> >         return nr_reclaimed;
> >  }
> > @@ -1245,6 +1259,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >         unsigned long nr_active;
> >         unsigned long nr_anon;
> >         unsigned long nr_file;
> > +       unsigned long nr_dirty;
> > 
> >         while (unlikely(too_many_isolated(zone, file, sc))) {
> >                 congestion_wait(BLK_RW_ASYNC, HZ/10);
> > @@ -1293,26 +1308,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> > 
> >         spin_unlock_irq(&zone->lru_lock);
> > 
> > -       nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> > +       nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> > +                                                               &nr_dirty);
> > 
> >         /*
> > -        * If we are direct reclaiming for contiguous pages and we do
> > +        * If specific pages are needed such as with direct reclaiming
> > +        * for contiguous pages or for memory containers and we do
> >          * not reclaim everything in the list, try again and wait
> > -        * for IO to complete. This will stall high-order allocations
> > -        * but that should be acceptable to the caller
> > +        * for IO to complete. This will stall callers that require
> > +        * specific pages but it should be acceptable to the caller
> >          */
> > -       if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> > -                       sc->lumpy_reclaim_mode) {
> > -               congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +       if (sc->may_writepage && !current_is_kswapd() &&
> > +                       (sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> > +               int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> > 
> > -               /*
> > -                * The attempt at page out may have made some
> > -                * of the pages active, mark them inactive again.
> > -                */
> > -               nr_active = clear_active_flags(&page_list, NULL);
> > -               count_vm_events(PGDEACTIVATE, nr_active);
> > +               while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > +                       wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> > +                       congestion_wait(BLK_RW_ASYNC, HZ/10);
> 
> It needs good luck for the flusher threads to "happen to" sync the
> dirty pages in our page_list.

That is why I'm expecting the "shrink oldest inode" patchset to help. It
still requires a certain amount of luck but callers that encounter dirty
pages will be delayed.

It's also because a certain amount of luck is required that the last patch
in the series aims at reducing the number of dirty pages encountered by
reclaim. The closer that is to 0, the less important the timing of flusher
threads is.

> I'd rather take the logic as "there are
> too many dirty pages, shrink them to avoid some future pageout() calls
> and/or congestion_wait() stalls".
> 

What do you mean by shrink them? They cannot be reclaimed until they are
clean.

> So the loop is likely to repeat MAX_SWAP_CLEAN_WAIT times.  Let's remove it?
> 

This loop only applies to direct reclaimers in lumpy reclaim mode and
memory containers. Both need specific pages to be cleaned and freed.
Hence, the loop is to stall them and wait on flusher threads up to a
point. Otherwise they can cause a reclaim storm of clean pages that
can't be used.

Current tests have not indicated MAX_SWAP_CLEAN_WAIT is regularly reached
but I am inferring this from timing data rather than a direct measurement.

> > -               nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > +                       /*
> > +                        * The attempt at page out may have made some
> > +                        * of the pages active, mark them inactive again.
> > +                        */
> > +                       nr_active = clear_active_flags(&page_list, NULL);
> > +                       count_vm_events(PGDEACTIVATE, nr_active);
> > +
> > +                       nr_reclaimed += shrink_page_list(&page_list, sc,
> > +                                               PAGEOUT_IO_SYNC, &nr_dirty);
> 
> This shrink_page_list() won't be called at all if nr_dirty==0 and
> pageout() was called. This is a change of behavior. It can also be
> fixed by removing the loop.
> 

The whole patch is a change of behaviour but in this case it also makes
sense to focus on just the dirty pages. The first shrink_page_list
decided that the pages could not be unmapped and reclaimed - probably
because it was referenced. This is not likely to change during the loop.

Testing with a version of the patch that processed the full list added
significant stalls when sync writeback was involved. Testing time length
was tripled in one case implying that this loop was continually reaching
MAX_SWAP_CLEAN_WAIT.

The intention of this loop is "wait on dirty pages to be cleaned" and
it's a change of behaviour, but one that makes sense and testing
indicates it's a good idea.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-26  7:28     ` Wu Fengguang
@ 2010-07-26  9:26       ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-26  9:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 26, 2010 at 03:28:32PM +0800, Wu Fengguang wrote:
> On Mon, Jul 19, 2010 at 09:11:30PM +0800, Mel Gorman wrote:
> > There are a number of cases where pages get cleaned but two of concern
> > to this patch are;
> >   o When dirtying pages, processes may be throttled to clean pages if
> >     dirty_ratio is not met.
> >   o Pages belonging to inodes dirtied longer than
> >     dirty_writeback_centisecs get cleaned.
> > 
> > The problem for reclaim is that dirty pages can reach the end of the LRU
> > if pages are being dirtied slowly so that neither the throttling cleans
> > them or a flusher thread waking periodically.
> > 
> > Background flush is already cleaning old or expired inodes first but the
> > expire time is too far in the future at the time of page reclaim. To mitigate
> > future problems, this patch wakes flusher threads to clean 1.5 times the
> > number of dirty pages encountered by reclaimers. The reasoning is that pages
> > were being dirtied at a roughly constant rate recently so if N dirty pages
> > were encountered in this scan block, we are likely to see roughly N dirty
> > pages again soon so try keep the flusher threads ahead of reclaim.
> > 
> > This is unfortunately very hand-wavy but there is not really a good way of
> > quantifying how bad it is when reclaim encounters dirty pages other than
> > "down with that sort of thing". Similarly, there is not an obvious way of
> > figuring how what percentage of dirty pages are old in terms of LRU-age and
> > should be cleaned. Ideally, the background flushers would only be cleaning
> > pages belonging to the zone being scanned but it's not clear if this would
> > be of benefit (less IO) or not (potentially less efficient IO if an inode
> > is scattered across multiple zones).
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/vmscan.c |   18 +++++++++++-------
> >  1 files changed, 11 insertions(+), 7 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index bc50937..5763719 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -806,6 +806,8 @@ restart_dirty:
> >  		}
> >  
> >  		if (PageDirty(page))  {
> > +			nr_dirty++;
> > +
> >  			/*
> >  			 * If the caller cannot writeback pages, dirty pages
> >  			 * are put on a separate list for cleaning by either
> > @@ -814,7 +816,6 @@ restart_dirty:
> >  			if (!reclaim_can_writeback(sc, page)) {
> >  				list_add(&page->lru, &dirty_pages);
> >  				unlock_page(page);
> > -				nr_dirty++;
> >  				goto keep_dirty;
> >  			}
> >  
> > @@ -933,13 +934,16 @@ keep_dirty:
> >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> >  	}
> >  
> > +	/*
> > +	 * If reclaim is encountering dirty pages, it may be because
> > +	 * dirty pages are reaching the end of the LRU even though
> > +	 * the dirty_ratio may be satisified. In this case, wake
> > +	 * flusher threads to pro-actively clean some pages
> > +	 */
> > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> 
> Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> number of dirty pages down to 0 whether or not pageout() is called.
> 

True, this has been fixed to only wakeup flusher threads when this is
the file LRU, dirty pages have been encountered and the caller has
sc->may_writepage.

> Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> for efficiency.
> And it seems good to let the flusher write much more
> than nr_dirty pages to safeguard a reasonable large
> vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> update the comments.
> 

Ok, the reasoning had been to flush a number of pages that was related
to the scanning rate but if that is inefficient for the flusher, I'll
use MAX_WRITEBACK_PAGES.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-26  9:26       ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-26  9:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 26, 2010 at 03:28:32PM +0800, Wu Fengguang wrote:
> On Mon, Jul 19, 2010 at 09:11:30PM +0800, Mel Gorman wrote:
> > There are a number of cases where pages get cleaned but two of concern
> > to this patch are;
> >   o When dirtying pages, processes may be throttled to clean pages if
> >     dirty_ratio is not met.
> >   o Pages belonging to inodes dirtied longer than
> >     dirty_writeback_centisecs get cleaned.
> > 
> > The problem for reclaim is that dirty pages can reach the end of the LRU
> > if pages are being dirtied slowly so that neither the throttling cleans
> > them or a flusher thread waking periodically.
> > 
> > Background flush is already cleaning old or expired inodes first but the
> > expire time is too far in the future at the time of page reclaim. To mitigate
> > future problems, this patch wakes flusher threads to clean 1.5 times the
> > number of dirty pages encountered by reclaimers. The reasoning is that pages
> > were being dirtied at a roughly constant rate recently so if N dirty pages
> > were encountered in this scan block, we are likely to see roughly N dirty
> > pages again soon so try keep the flusher threads ahead of reclaim.
> > 
> > This is unfortunately very hand-wavy but there is not really a good way of
> > quantifying how bad it is when reclaim encounters dirty pages other than
> > "down with that sort of thing". Similarly, there is not an obvious way of
> > figuring how what percentage of dirty pages are old in terms of LRU-age and
> > should be cleaned. Ideally, the background flushers would only be cleaning
> > pages belonging to the zone being scanned but it's not clear if this would
> > be of benefit (less IO) or not (potentially less efficient IO if an inode
> > is scattered across multiple zones).
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/vmscan.c |   18 +++++++++++-------
> >  1 files changed, 11 insertions(+), 7 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index bc50937..5763719 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -806,6 +806,8 @@ restart_dirty:
> >  		}
> >  
> >  		if (PageDirty(page))  {
> > +			nr_dirty++;
> > +
> >  			/*
> >  			 * If the caller cannot writeback pages, dirty pages
> >  			 * are put on a separate list for cleaning by either
> > @@ -814,7 +816,6 @@ restart_dirty:
> >  			if (!reclaim_can_writeback(sc, page)) {
> >  				list_add(&page->lru, &dirty_pages);
> >  				unlock_page(page);
> > -				nr_dirty++;
> >  				goto keep_dirty;
> >  			}
> >  
> > @@ -933,13 +934,16 @@ keep_dirty:
> >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> >  	}
> >  
> > +	/*
> > +	 * If reclaim is encountering dirty pages, it may be because
> > +	 * dirty pages are reaching the end of the LRU even though
> > +	 * the dirty_ratio may be satisified. In this case, wake
> > +	 * flusher threads to pro-actively clean some pages
> > +	 */
> > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> 
> Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> number of dirty pages down to 0 whether or not pageout() is called.
> 

True, this has been fixed to only wakeup flusher threads when this is
the file LRU, dirty pages have been encountered and the caller has
sc->may_writepage.

> Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> for efficiency.
> And it seems good to let the flusher write much more
> than nr_dirty pages to safeguard a reasonable large
> vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> update the comments.
> 

Ok, the reasoning had been to flush a number of pages that was related
to the scanning rate but if that is inefficient for the flusher, I'll
use MAX_WRITEBACK_PAGES.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-26  9:12                   ` Mel Gorman
@ 2010-07-26 11:19                     ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26 11:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli

On Mon, Jul 26, 2010 at 05:12:27PM +0800, Mel Gorman wrote:
> On Mon, Jul 26, 2010 at 04:29:35PM +0800, Wu Fengguang wrote:
> > > ==== CUT HERE ====
> > > vmscan: Do not writeback filesystem pages in direct reclaim
> > > 
> > > When memory is under enough pressure, a process may enter direct
> > > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > > encountered during the scan, this page is written to backing storage using
> > > mapping->writepage. This can result in very deep call stacks, particularly
> > > if the target storage or filesystem are complex. It has already been observed
> > > on XFS that the stack overflows but the problem is not XFS-specific.
> > > 
> > > This patch prevents direct reclaim writing back filesystem pages by checking
> > > if current is kswapd or the page is anonymous before writing back.  If the
> > > dirty pages cannot be written back, they are placed back on the LRU lists
> > > for either background writing by the BDI threads or kswapd. If in direct
> > > lumpy reclaim and dirty pages are encountered, the process will stall for
> > > the background flusher before trying to reclaim the pages again.
> > > 
> > > As the call-chain for writing anonymous pages is not expected to be deep
> > > and they are not cleaned by flusher threads, anonymous pages are still
> > > written back in direct reclaim.
> > 
> > This is also a good step towards reducing pageout() calls. For better
> > IO performance the flusher threads should take more work from pageout().
> > 
> 
> This is true for better IO performance all right but reclaim does require
> specific pages cleaned. The strict requirement is when lumpy reclaim is
> involved but a looser requirement is when any pages within a zone be cleaned.

Good point, I missed the lumpy reclaim requirement. It seems necessary
to add a call to the flusher thread to writeback a specific inode range
(that contains the current dirty page). This is a more reliable way to
ensure both the strict and looser requirements: the current dirty page
will guaranteed to be synced, and the inode will have good opportunity 
to contain more dirty pages in the zone, which can be freed quickly if
tagged PG_reclaim.

> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > > ---
> > >  mm/vmscan.c |   55 +++++++++++++++++++++++++++++++++++++++----------------
> > >  1 files changed, 39 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 6587155..45d9934 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -139,6 +139,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
> > >  #define scanning_global_lru(sc)        (1)
> > >  #endif
> > > 
> > > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> > > +#define MAX_SWAP_CLEAN_WAIT 50
> > > +
> > >  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> > >                                                   struct scan_control *sc)
> > >  {
> > > @@ -644,11 +647,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> > >   */
> > >  static unsigned long shrink_page_list(struct list_head *page_list,
> > >                                         struct scan_control *sc,
> > > -                                       enum pageout_io sync_writeback)
> > > +                                       enum pageout_io sync_writeback,
> > > +                                       unsigned long *nr_still_dirty)
> > >  {
> > >         LIST_HEAD(ret_pages);
> > >         LIST_HEAD(free_pages);
> > >         int pgactivate = 0;
> > > +       unsigned long nr_dirty = 0;
> > >         unsigned long nr_reclaimed = 0;
> > > 
> > >         cond_resched();
> > > @@ -742,6 +747,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > >                 }
> > > 
> > >                 if (PageDirty(page)) {
> > > +                       /*
> > > +                        * Only kswapd can writeback filesystem pages to
> > > +                        * avoid risk of stack overflow
> > > +                        */
> > > +                       if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > +                               nr_dirty++;
> > > +                               goto keep_locked;
> > > +                       }
> > > +
> > >                         if (references == PAGEREF_RECLAIM_CLEAN)
> > >                                 goto keep_locked;
> > >                         if (!may_enter_fs)
> > > @@ -858,7 +872,7 @@ keep:
> > > 
> > >         free_page_list(&free_pages);
> > > 
> > > -       list_splice(&ret_pages, page_list);
> > > +       *nr_still_dirty = nr_dirty;
> > >         count_vm_events(PGACTIVATE, pgactivate);
> > >         return nr_reclaimed;
> > >  }
> > > @@ -1245,6 +1259,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> > >         unsigned long nr_active;
> > >         unsigned long nr_anon;
> > >         unsigned long nr_file;
> > > +       unsigned long nr_dirty;
> > > 
> > >         while (unlikely(too_many_isolated(zone, file, sc))) {
> > >                 congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > @@ -1293,26 +1308,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> > > 
> > >         spin_unlock_irq(&zone->lru_lock);
> > > 
> > > -       nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> > > +       nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> > > +                                                               &nr_dirty);
> > > 
> > >         /*
> > > -        * If we are direct reclaiming for contiguous pages and we do
> > > +        * If specific pages are needed such as with direct reclaiming
> > > +        * for contiguous pages or for memory containers and we do
> > >          * not reclaim everything in the list, try again and wait
> > > -        * for IO to complete. This will stall high-order allocations
> > > -        * but that should be acceptable to the caller
> > > +        * for IO to complete. This will stall callers that require
> > > +        * specific pages but it should be acceptable to the caller
> > >          */
> > > -       if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> > > -                       sc->lumpy_reclaim_mode) {
> > > -               congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > +       if (sc->may_writepage && !current_is_kswapd() &&
> > > +                       (sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> > > +               int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> > > 
> > > -               /*
> > > -                * The attempt at page out may have made some
> > > -                * of the pages active, mark them inactive again.
> > > -                */
> > > -               nr_active = clear_active_flags(&page_list, NULL);
> > > -               count_vm_events(PGDEACTIVATE, nr_active);
> > > +               while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > > +                       wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> > > +                       congestion_wait(BLK_RW_ASYNC, HZ/10);
> > 
> > It needs good luck for the flusher threads to "happen to" sync the
> > dirty pages in our page_list.
> 
> That is why I'm expecting the "shrink oldest inode" patchset to help. It
> still requires a certain amount of luck but callers that encounter dirty
> pages will be delayed.
> 
> It's also because a certain amount of luck is required that the last patch
> in the series aims at reducing the number of dirty pages encountered by
> reclaim. The closer that is to 0, the less important the timing of flusher
> threads is.

OK.

> > I'd rather take the logic as "there are
> > too many dirty pages, shrink them to avoid some future pageout() calls
> > and/or congestion_wait() stalls".
> > 
> 
> What do you mean by shrink them? They cannot be reclaimed until they are
> clean.

I mean we are freeing much more than nr_dirty pages. In this sense we
are shrinking the number of dirty pages. Note that we are calling
wakeup_flusher_threads(nr_dirty), however the real synced pages will
be much more than nr_dirty, that is reasonable good behavior.

> > So the loop is likely to repeat MAX_SWAP_CLEAN_WAIT times.  Let's remove it?
> > 
> 
> This loop only applies to direct reclaimers in lumpy reclaim mode and
> memory containers. Both need specific pages to be cleaned and freed.
> Hence, the loop is to stall them and wait on flusher threads up to a
> point. Otherwise they can cause a reclaim storm of clean pages that
> can't be used.

Agreed. We could call the flusher to sync the inode explicitly, as
recommended above. This will clean and free (with PG_reclaim) the page
in seconds. With reasonable waits here we may avoid reclaim storm
effectively.

> Current tests have not indicated MAX_SWAP_CLEAN_WAIT is regularly reached
> but I am inferring this from timing data rather than a direct measurement.
> 
> > > -               nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > > +                       /*
> > > +                        * The attempt at page out may have made some
> > > +                        * of the pages active, mark them inactive again.
> > > +                        */
> > > +                       nr_active = clear_active_flags(&page_list, NULL);
> > > +                       count_vm_events(PGDEACTIVATE, nr_active);
> > > +
> > > +                       nr_reclaimed += shrink_page_list(&page_list, sc,
> > > +                                               PAGEOUT_IO_SYNC, &nr_dirty);
> > 
> > This shrink_page_list() won't be called at all if nr_dirty==0 and
> > pageout() was called. This is a change of behavior. It can also be
> > fixed by removing the loop.
> > 
> 
> The whole patch is a change of behaviour but in this case it also makes
> sense to focus on just the dirty pages. The first shrink_page_list
> decided that the pages could not be unmapped and reclaimed - probably
> because it was referenced. This is not likely to change during the loop.

Agreed.

> Testing with a version of the patch that processed the full list added
> significant stalls when sync writeback was involved. Testing time length
> was tripled in one case implying that this loop was continually reaching
> MAX_SWAP_CLEAN_WAIT.

I'm OK with the change actually, this removes one not-that-user-friendly
wait_on_page_writeback().

> The intention of this loop is "wait on dirty pages to be cleaned" and
> it's a change of behaviour, but one that makes sense and testing
> indicates it's a good idea.

I mean, this loop may be unwinded. And we may need another loop to
sync the inodes that contains the dirty pages.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-26 11:19                     ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26 11:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli

On Mon, Jul 26, 2010 at 05:12:27PM +0800, Mel Gorman wrote:
> On Mon, Jul 26, 2010 at 04:29:35PM +0800, Wu Fengguang wrote:
> > > ==== CUT HERE ====
> > > vmscan: Do not writeback filesystem pages in direct reclaim
> > > 
> > > When memory is under enough pressure, a process may enter direct
> > > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > > encountered during the scan, this page is written to backing storage using
> > > mapping->writepage. This can result in very deep call stacks, particularly
> > > if the target storage or filesystem are complex. It has already been observed
> > > on XFS that the stack overflows but the problem is not XFS-specific.
> > > 
> > > This patch prevents direct reclaim writing back filesystem pages by checking
> > > if current is kswapd or the page is anonymous before writing back.  If the
> > > dirty pages cannot be written back, they are placed back on the LRU lists
> > > for either background writing by the BDI threads or kswapd. If in direct
> > > lumpy reclaim and dirty pages are encountered, the process will stall for
> > > the background flusher before trying to reclaim the pages again.
> > > 
> > > As the call-chain for writing anonymous pages is not expected to be deep
> > > and they are not cleaned by flusher threads, anonymous pages are still
> > > written back in direct reclaim.
> > 
> > This is also a good step towards reducing pageout() calls. For better
> > IO performance the flusher threads should take more work from pageout().
> > 
> 
> This is true for better IO performance all right but reclaim does require
> specific pages cleaned. The strict requirement is when lumpy reclaim is
> involved but a looser requirement is when any pages within a zone be cleaned.

Good point, I missed the lumpy reclaim requirement. It seems necessary
to add a call to the flusher thread to writeback a specific inode range
(that contains the current dirty page). This is a more reliable way to
ensure both the strict and looser requirements: the current dirty page
will guaranteed to be synced, and the inode will have good opportunity 
to contain more dirty pages in the zone, which can be freed quickly if
tagged PG_reclaim.

> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > > ---
> > >  mm/vmscan.c |   55 +++++++++++++++++++++++++++++++++++++++----------------
> > >  1 files changed, 39 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 6587155..45d9934 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -139,6 +139,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
> > >  #define scanning_global_lru(sc)        (1)
> > >  #endif
> > > 
> > > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> > > +#define MAX_SWAP_CLEAN_WAIT 50
> > > +
> > >  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> > >                                                   struct scan_control *sc)
> > >  {
> > > @@ -644,11 +647,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> > >   */
> > >  static unsigned long shrink_page_list(struct list_head *page_list,
> > >                                         struct scan_control *sc,
> > > -                                       enum pageout_io sync_writeback)
> > > +                                       enum pageout_io sync_writeback,
> > > +                                       unsigned long *nr_still_dirty)
> > >  {
> > >         LIST_HEAD(ret_pages);
> > >         LIST_HEAD(free_pages);
> > >         int pgactivate = 0;
> > > +       unsigned long nr_dirty = 0;
> > >         unsigned long nr_reclaimed = 0;
> > > 
> > >         cond_resched();
> > > @@ -742,6 +747,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > >                 }
> > > 
> > >                 if (PageDirty(page)) {
> > > +                       /*
> > > +                        * Only kswapd can writeback filesystem pages to
> > > +                        * avoid risk of stack overflow
> > > +                        */
> > > +                       if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > +                               nr_dirty++;
> > > +                               goto keep_locked;
> > > +                       }
> > > +
> > >                         if (references == PAGEREF_RECLAIM_CLEAN)
> > >                                 goto keep_locked;
> > >                         if (!may_enter_fs)
> > > @@ -858,7 +872,7 @@ keep:
> > > 
> > >         free_page_list(&free_pages);
> > > 
> > > -       list_splice(&ret_pages, page_list);
> > > +       *nr_still_dirty = nr_dirty;
> > >         count_vm_events(PGACTIVATE, pgactivate);
> > >         return nr_reclaimed;
> > >  }
> > > @@ -1245,6 +1259,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> > >         unsigned long nr_active;
> > >         unsigned long nr_anon;
> > >         unsigned long nr_file;
> > > +       unsigned long nr_dirty;
> > > 
> > >         while (unlikely(too_many_isolated(zone, file, sc))) {
> > >                 congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > @@ -1293,26 +1308,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> > > 
> > >         spin_unlock_irq(&zone->lru_lock);
> > > 
> > > -       nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> > > +       nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> > > +                                                               &nr_dirty);
> > > 
> > >         /*
> > > -        * If we are direct reclaiming for contiguous pages and we do
> > > +        * If specific pages are needed such as with direct reclaiming
> > > +        * for contiguous pages or for memory containers and we do
> > >          * not reclaim everything in the list, try again and wait
> > > -        * for IO to complete. This will stall high-order allocations
> > > -        * but that should be acceptable to the caller
> > > +        * for IO to complete. This will stall callers that require
> > > +        * specific pages but it should be acceptable to the caller
> > >          */
> > > -       if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> > > -                       sc->lumpy_reclaim_mode) {
> > > -               congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > +       if (sc->may_writepage && !current_is_kswapd() &&
> > > +                       (sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> > > +               int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> > > 
> > > -               /*
> > > -                * The attempt at page out may have made some
> > > -                * of the pages active, mark them inactive again.
> > > -                */
> > > -               nr_active = clear_active_flags(&page_list, NULL);
> > > -               count_vm_events(PGDEACTIVATE, nr_active);
> > > +               while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > > +                       wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> > > +                       congestion_wait(BLK_RW_ASYNC, HZ/10);
> > 
> > It needs good luck for the flusher threads to "happen to" sync the
> > dirty pages in our page_list.
> 
> That is why I'm expecting the "shrink oldest inode" patchset to help. It
> still requires a certain amount of luck but callers that encounter dirty
> pages will be delayed.
> 
> It's also because a certain amount of luck is required that the last patch
> in the series aims at reducing the number of dirty pages encountered by
> reclaim. The closer that is to 0, the less important the timing of flusher
> threads is.

OK.

> > I'd rather take the logic as "there are
> > too many dirty pages, shrink them to avoid some future pageout() calls
> > and/or congestion_wait() stalls".
> > 
> 
> What do you mean by shrink them? They cannot be reclaimed until they are
> clean.

I mean we are freeing much more than nr_dirty pages. In this sense we
are shrinking the number of dirty pages. Note that we are calling
wakeup_flusher_threads(nr_dirty), however the real synced pages will
be much more than nr_dirty, that is reasonable good behavior.

> > So the loop is likely to repeat MAX_SWAP_CLEAN_WAIT times.  Let's remove it?
> > 
> 
> This loop only applies to direct reclaimers in lumpy reclaim mode and
> memory containers. Both need specific pages to be cleaned and freed.
> Hence, the loop is to stall them and wait on flusher threads up to a
> point. Otherwise they can cause a reclaim storm of clean pages that
> can't be used.

Agreed. We could call the flusher to sync the inode explicitly, as
recommended above. This will clean and free (with PG_reclaim) the page
in seconds. With reasonable waits here we may avoid reclaim storm
effectively.

> Current tests have not indicated MAX_SWAP_CLEAN_WAIT is regularly reached
> but I am inferring this from timing data rather than a direct measurement.
> 
> > > -               nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > > +                       /*
> > > +                        * The attempt at page out may have made some
> > > +                        * of the pages active, mark them inactive again.
> > > +                        */
> > > +                       nr_active = clear_active_flags(&page_list, NULL);
> > > +                       count_vm_events(PGDEACTIVATE, nr_active);
> > > +
> > > +                       nr_reclaimed += shrink_page_list(&page_list, sc,
> > > +                                               PAGEOUT_IO_SYNC, &nr_dirty);
> > 
> > This shrink_page_list() won't be called at all if nr_dirty==0 and
> > pageout() was called. This is a change of behavior. It can also be
> > fixed by removing the loop.
> > 
> 
> The whole patch is a change of behaviour but in this case it also makes
> sense to focus on just the dirty pages. The first shrink_page_list
> decided that the pages could not be unmapped and reclaimed - probably
> because it was referenced. This is not likely to change during the loop.

Agreed.

> Testing with a version of the patch that processed the full list added
> significant stalls when sync writeback was involved. Testing time length
> was tripled in one case implying that this loop was continually reaching
> MAX_SWAP_CLEAN_WAIT.

I'm OK with the change actually, this removes one not-that-user-friendly
wait_on_page_writeback().

> The intention of this loop is "wait on dirty pages to be cleaned" and
> it's a change of behaviour, but one that makes sense and testing
> indicates it's a good idea.

I mean, this loop may be unwinded. And we may need another loop to
sync the inodes that contains the dirty pages.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-26  9:26       ` Mel Gorman
@ 2010-07-26 11:27         ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26 11:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

> > > @@ -933,13 +934,16 @@ keep_dirty:
> > >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > >  	}
> > >  
> > > +	/*
> > > +	 * If reclaim is encountering dirty pages, it may be because
> > > +	 * dirty pages are reaching the end of the LRU even though
> > > +	 * the dirty_ratio may be satisified. In this case, wake
> > > +	 * flusher threads to pro-actively clean some pages
> > > +	 */
> > > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > 
> > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > number of dirty pages down to 0 whether or not pageout() is called.
> > 
> 
> True, this has been fixed to only wakeup flusher threads when this is
> the file LRU, dirty pages have been encountered and the caller has
> sc->may_writepage.

OK.

> > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > for efficiency.
> > And it seems good to let the flusher write much more
> > than nr_dirty pages to safeguard a reasonable large
> > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > update the comments.
> > 
> 
> Ok, the reasoning had been to flush a number of pages that was related
> to the scanning rate but if that is inefficient for the flusher, I'll
> use MAX_WRITEBACK_PAGES.

It would be better to pass something like (nr_dirty * N).
MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
obviously too large as a parameter. When the batch size is increased
to 128MB, the writeback code may be improved somehow to not exceed the
nr_pages limit too much.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-26 11:27         ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26 11:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

> > > @@ -933,13 +934,16 @@ keep_dirty:
> > >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > >  	}
> > >  
> > > +	/*
> > > +	 * If reclaim is encountering dirty pages, it may be because
> > > +	 * dirty pages are reaching the end of the LRU even though
> > > +	 * the dirty_ratio may be satisified. In this case, wake
> > > +	 * flusher threads to pro-actively clean some pages
> > > +	 */
> > > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > 
> > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > number of dirty pages down to 0 whether or not pageout() is called.
> > 
> 
> True, this has been fixed to only wakeup flusher threads when this is
> the file LRU, dirty pages have been encountered and the caller has
> sc->may_writepage.

OK.

> > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > for efficiency.
> > And it seems good to let the flusher write much more
> > than nr_dirty pages to safeguard a reasonable large
> > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > update the comments.
> > 
> 
> Ok, the reasoning had been to flush a number of pages that was related
> to the scanning rate but if that is inefficient for the flusher, I'll
> use MAX_WRITEBACK_PAGES.

It would be better to pass something like (nr_dirty * N).
MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
obviously too large as a parameter. When the batch size is increased
to 128MB, the writeback code may be improved somehow to not exceed the
nr_pages limit too much.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-26 11:19                     ` Wu Fengguang
@ 2010-07-26 12:53                       ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-26 12:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli

On Mon, Jul 26, 2010 at 07:19:53PM +0800, Wu Fengguang wrote:
> On Mon, Jul 26, 2010 at 05:12:27PM +0800, Mel Gorman wrote:
> > On Mon, Jul 26, 2010 at 04:29:35PM +0800, Wu Fengguang wrote:
> > > > ==== CUT HERE ====
> > > > vmscan: Do not writeback filesystem pages in direct reclaim
> > > > 
> > > > When memory is under enough pressure, a process may enter direct
> > > > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > > > encountered during the scan, this page is written to backing storage using
> > > > mapping->writepage. This can result in very deep call stacks, particularly
> > > > if the target storage or filesystem are complex. It has already been observed
> > > > on XFS that the stack overflows but the problem is not XFS-specific.
> > > > 
> > > > This patch prevents direct reclaim writing back filesystem pages by checking
> > > > if current is kswapd or the page is anonymous before writing back.  If the
> > > > dirty pages cannot be written back, they are placed back on the LRU lists
> > > > for either background writing by the BDI threads or kswapd. If in direct
> > > > lumpy reclaim and dirty pages are encountered, the process will stall for
> > > > the background flusher before trying to reclaim the pages again.
> > > > 
> > > > As the call-chain for writing anonymous pages is not expected to be deep
> > > > and they are not cleaned by flusher threads, anonymous pages are still
> > > > written back in direct reclaim.
> > > 
> > > This is also a good step towards reducing pageout() calls. For better
> > > IO performance the flusher threads should take more work from pageout().
> > > 
> > 
> > This is true for better IO performance all right but reclaim does require
> > specific pages cleaned. The strict requirement is when lumpy reclaim is
> > involved but a looser requirement is when any pages within a zone be cleaned.
> 
> Good point, I missed the lumpy reclaim requirement. It seems necessary
> to add a call to the flusher thread to writeback a specific inode range
> (that contains the current dirty page). This is a more reliable way to
> ensure both the strict and looser requirements: the current dirty page
> will guaranteed to be synced, and the inode will have good opportunity 
> to contain more dirty pages in the zone, which can be freed quickly if
> tagged PG_reclaim.
> 

I'm not sure about an inode range. The window being considered is quite small
and we might select ranges that are too small to be useful.  However, taking
the inodes into account makes sense. If wakeup_flusher_thread took a list
of unique inodes that own dirty pages encountered by reclaim, it would then
move inodes to the head of the queue rather than depending just on expired.

> > > > <SNIP>
> > > >
> > > > @@ -1293,26 +1308,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> > > > 
> > > >         spin_unlock_irq(&zone->lru_lock);
> > > > 
> > > > -       nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> > > > +       nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> > > > +                                                               &nr_dirty);
> > > > 
> > > >         /*
> > > > -        * If we are direct reclaiming for contiguous pages and we do
> > > > +        * If specific pages are needed such as with direct reclaiming
> > > > +        * for contiguous pages or for memory containers and we do
> > > >          * not reclaim everything in the list, try again and wait
> > > > -        * for IO to complete. This will stall high-order allocations
> > > > -        * but that should be acceptable to the caller
> > > > +        * for IO to complete. This will stall callers that require
> > > > +        * specific pages but it should be acceptable to the caller
> > > >          */
> > > > -       if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> > > > -                       sc->lumpy_reclaim_mode) {
> > > > -               congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > +       if (sc->may_writepage && !current_is_kswapd() &&
> > > > +                       (sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> > > > +               int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> > > > 
> > > > -               /*
> > > > -                * The attempt at page out may have made some
> > > > -                * of the pages active, mark them inactive again.
> > > > -                */
> > > > -               nr_active = clear_active_flags(&page_list, NULL);
> > > > -               count_vm_events(PGDEACTIVATE, nr_active);
> > > > +               while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > > > +                       wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> > > > +                       congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > 
> > > It needs good luck for the flusher threads to "happen to" sync the
> > > dirty pages in our page_list.
> > 
> > That is why I'm expecting the "shrink oldest inode" patchset to help. It
> > still requires a certain amount of luck but callers that encounter dirty
> > pages will be delayed.
> > 
> > It's also because a certain amount of luck is required that the last patch
> > in the series aims at reducing the number of dirty pages encountered by
> > reclaim. The closer that is to 0, the less important the timing of flusher
> > threads is.
> 
> OK.
> 
> > > I'd rather take the logic as "there are
> > > too many dirty pages, shrink them to avoid some future pageout() calls
> > > and/or congestion_wait() stalls".
> > > 
> > 
> > What do you mean by shrink them? They cannot be reclaimed until they are
> > clean.
> 
> I mean we are freeing much more than nr_dirty pages. In this sense we
> are shrinking the number of dirty pages. Note that we are calling
> wakeup_flusher_threads(nr_dirty), however the real synced pages will
> be much more than nr_dirty, that is reasonable good behavior.
> 

Ok.

> > > So the loop is likely to repeat MAX_SWAP_CLEAN_WAIT times.  Let's remove it?
> > > 
> > 
> > This loop only applies to direct reclaimers in lumpy reclaim mode and
> > memory containers. Both need specific pages to be cleaned and freed.
> > Hence, the loop is to stall them and wait on flusher threads up to a
> > point. Otherwise they can cause a reclaim storm of clean pages that
> > can't be used.
> 
> Agreed. We could call the flusher to sync the inode explicitly, as
> recommended above. This will clean and free (with PG_reclaim) the page
> in seconds. With reasonable waits here we may avoid reclaim storm
> effectively.
> 

I'll follow this suggestion as a new patch.

> > Current tests have not indicated MAX_SWAP_CLEAN_WAIT is regularly reached
> > but I am inferring this from timing data rather than a direct measurement.
> > 
> > > > -               nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > > > +                       /*
> > > > +                        * The attempt at page out may have made some
> > > > +                        * of the pages active, mark them inactive again.
> > > > +                        */
> > > > +                       nr_active = clear_active_flags(&page_list, NULL);
> > > > +                       count_vm_events(PGDEACTIVATE, nr_active);
> > > > +
> > > > +                       nr_reclaimed += shrink_page_list(&page_list, sc,
> > > > +                                               PAGEOUT_IO_SYNC, &nr_dirty);
> > > 
> > > This shrink_page_list() won't be called at all if nr_dirty==0 and
> > > pageout() was called. This is a change of behavior. It can also be
> > > fixed by removing the loop.
> > > 
> > 
> > The whole patch is a change of behaviour but in this case it also makes
> > sense to focus on just the dirty pages. The first shrink_page_list
> > decided that the pages could not be unmapped and reclaimed - probably
> > because it was referenced. This is not likely to change during the loop.
> 
> Agreed.
> 
> > Testing with a version of the patch that processed the full list added
> > significant stalls when sync writeback was involved. Testing time length
> > was tripled in one case implying that this loop was continually reaching
> > MAX_SWAP_CLEAN_WAIT.
> 
> I'm OK with the change actually, this removes one not-that-user-friendly
> wait_on_page_writeback().
> 
> > The intention of this loop is "wait on dirty pages to be cleaned" and
> > it's a change of behaviour, but one that makes sense and testing
> > indicates it's a good idea.
> 
> I mean, this loop may be unwinded. And we may need another loop to
> sync the inodes that contains the dirty pages.
> 

I'm not quite sure what you mean here but I think it might tie into the
idea of passing a list of inodes to the flusher threads. Lets see what
that ends up looking like.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-26 12:53                       ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-26 12:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli

On Mon, Jul 26, 2010 at 07:19:53PM +0800, Wu Fengguang wrote:
> On Mon, Jul 26, 2010 at 05:12:27PM +0800, Mel Gorman wrote:
> > On Mon, Jul 26, 2010 at 04:29:35PM +0800, Wu Fengguang wrote:
> > > > ==== CUT HERE ====
> > > > vmscan: Do not writeback filesystem pages in direct reclaim
> > > > 
> > > > When memory is under enough pressure, a process may enter direct
> > > > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > > > encountered during the scan, this page is written to backing storage using
> > > > mapping->writepage. This can result in very deep call stacks, particularly
> > > > if the target storage or filesystem are complex. It has already been observed
> > > > on XFS that the stack overflows but the problem is not XFS-specific.
> > > > 
> > > > This patch prevents direct reclaim writing back filesystem pages by checking
> > > > if current is kswapd or the page is anonymous before writing back.  If the
> > > > dirty pages cannot be written back, they are placed back on the LRU lists
> > > > for either background writing by the BDI threads or kswapd. If in direct
> > > > lumpy reclaim and dirty pages are encountered, the process will stall for
> > > > the background flusher before trying to reclaim the pages again.
> > > > 
> > > > As the call-chain for writing anonymous pages is not expected to be deep
> > > > and they are not cleaned by flusher threads, anonymous pages are still
> > > > written back in direct reclaim.
> > > 
> > > This is also a good step towards reducing pageout() calls. For better
> > > IO performance the flusher threads should take more work from pageout().
> > > 
> > 
> > This is true for better IO performance all right but reclaim does require
> > specific pages cleaned. The strict requirement is when lumpy reclaim is
> > involved but a looser requirement is when any pages within a zone be cleaned.
> 
> Good point, I missed the lumpy reclaim requirement. It seems necessary
> to add a call to the flusher thread to writeback a specific inode range
> (that contains the current dirty page). This is a more reliable way to
> ensure both the strict and looser requirements: the current dirty page
> will guaranteed to be synced, and the inode will have good opportunity 
> to contain more dirty pages in the zone, which can be freed quickly if
> tagged PG_reclaim.
> 

I'm not sure about an inode range. The window being considered is quite small
and we might select ranges that are too small to be useful.  However, taking
the inodes into account makes sense. If wakeup_flusher_thread took a list
of unique inodes that own dirty pages encountered by reclaim, it would then
move inodes to the head of the queue rather than depending just on expired.

> > > > <SNIP>
> > > >
> > > > @@ -1293,26 +1308,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> > > > 
> > > >         spin_unlock_irq(&zone->lru_lock);
> > > > 
> > > > -       nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> > > > +       nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> > > > +                                                               &nr_dirty);
> > > > 
> > > >         /*
> > > > -        * If we are direct reclaiming for contiguous pages and we do
> > > > +        * If specific pages are needed such as with direct reclaiming
> > > > +        * for contiguous pages or for memory containers and we do
> > > >          * not reclaim everything in the list, try again and wait
> > > > -        * for IO to complete. This will stall high-order allocations
> > > > -        * but that should be acceptable to the caller
> > > > +        * for IO to complete. This will stall callers that require
> > > > +        * specific pages but it should be acceptable to the caller
> > > >          */
> > > > -       if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> > > > -                       sc->lumpy_reclaim_mode) {
> > > > -               congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > +       if (sc->may_writepage && !current_is_kswapd() &&
> > > > +                       (sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> > > > +               int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> > > > 
> > > > -               /*
> > > > -                * The attempt at page out may have made some
> > > > -                * of the pages active, mark them inactive again.
> > > > -                */
> > > > -               nr_active = clear_active_flags(&page_list, NULL);
> > > > -               count_vm_events(PGDEACTIVATE, nr_active);
> > > > +               while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > > > +                       wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> > > > +                       congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > 
> > > It needs good luck for the flusher threads to "happen to" sync the
> > > dirty pages in our page_list.
> > 
> > That is why I'm expecting the "shrink oldest inode" patchset to help. It
> > still requires a certain amount of luck but callers that encounter dirty
> > pages will be delayed.
> > 
> > It's also because a certain amount of luck is required that the last patch
> > in the series aims at reducing the number of dirty pages encountered by
> > reclaim. The closer that is to 0, the less important the timing of flusher
> > threads is.
> 
> OK.
> 
> > > I'd rather take the logic as "there are
> > > too many dirty pages, shrink them to avoid some future pageout() calls
> > > and/or congestion_wait() stalls".
> > > 
> > 
> > What do you mean by shrink them? They cannot be reclaimed until they are
> > clean.
> 
> I mean we are freeing much more than nr_dirty pages. In this sense we
> are shrinking the number of dirty pages. Note that we are calling
> wakeup_flusher_threads(nr_dirty), however the real synced pages will
> be much more than nr_dirty, that is reasonable good behavior.
> 

Ok.

> > > So the loop is likely to repeat MAX_SWAP_CLEAN_WAIT times.  Let's remove it?
> > > 
> > 
> > This loop only applies to direct reclaimers in lumpy reclaim mode and
> > memory containers. Both need specific pages to be cleaned and freed.
> > Hence, the loop is to stall them and wait on flusher threads up to a
> > point. Otherwise they can cause a reclaim storm of clean pages that
> > can't be used.
> 
> Agreed. We could call the flusher to sync the inode explicitly, as
> recommended above. This will clean and free (with PG_reclaim) the page
> in seconds. With reasonable waits here we may avoid reclaim storm
> effectively.
> 

I'll follow this suggestion as a new patch.

> > Current tests have not indicated MAX_SWAP_CLEAN_WAIT is regularly reached
> > but I am inferring this from timing data rather than a direct measurement.
> > 
> > > > -               nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > > > +                       /*
> > > > +                        * The attempt at page out may have made some
> > > > +                        * of the pages active, mark them inactive again.
> > > > +                        */
> > > > +                       nr_active = clear_active_flags(&page_list, NULL);
> > > > +                       count_vm_events(PGDEACTIVATE, nr_active);
> > > > +
> > > > +                       nr_reclaimed += shrink_page_list(&page_list, sc,
> > > > +                                               PAGEOUT_IO_SYNC, &nr_dirty);
> > > 
> > > This shrink_page_list() won't be called at all if nr_dirty==0 and
> > > pageout() was called. This is a change of behavior. It can also be
> > > fixed by removing the loop.
> > > 
> > 
> > The whole patch is a change of behaviour but in this case it also makes
> > sense to focus on just the dirty pages. The first shrink_page_list
> > decided that the pages could not be unmapped and reclaimed - probably
> > because it was referenced. This is not likely to change during the loop.
> 
> Agreed.
> 
> > Testing with a version of the patch that processed the full list added
> > significant stalls when sync writeback was involved. Testing time length
> > was tripled in one case implying that this loop was continually reaching
> > MAX_SWAP_CLEAN_WAIT.
> 
> I'm OK with the change actually, this removes one not-that-user-friendly
> wait_on_page_writeback().
> 
> > The intention of this loop is "wait on dirty pages to be cleaned" and
> > it's a change of behaviour, but one that makes sense and testing
> > indicates it's a good idea.
> 
> I mean, this loop may be unwinded. And we may need another loop to
> sync the inodes that contains the dirty pages.
> 

I'm not quite sure what you mean here but I think it might tie into the
idea of passing a list of inodes to the flusher threads. Lets see what
that ends up looking like.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-26 11:27         ` Wu Fengguang
@ 2010-07-26 12:57           ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-26 12:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > >  	}
> > > >  
> > > > +	/*
> > > > +	 * If reclaim is encountering dirty pages, it may be because
> > > > +	 * dirty pages are reaching the end of the LRU even though
> > > > +	 * the dirty_ratio may be satisified. In this case, wake
> > > > +	 * flusher threads to pro-actively clean some pages
> > > > +	 */
> > > > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > 
> > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > number of dirty pages down to 0 whether or not pageout() is called.
> > > 
> > 
> > True, this has been fixed to only wakeup flusher threads when this is
> > the file LRU, dirty pages have been encountered and the caller has
> > sc->may_writepage.
> 
> OK.
> 
> > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > for efficiency.
> > > And it seems good to let the flusher write much more
> > > than nr_dirty pages to safeguard a reasonable large
> > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > update the comments.
> > > 
> > 
> > Ok, the reasoning had been to flush a number of pages that was related
> > to the scanning rate but if that is inefficient for the flusher, I'll
> > use MAX_WRITEBACK_PAGES.
> 
> It would be better to pass something like (nr_dirty * N).
> MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> obviously too large as a parameter. When the batch size is increased
> to 128MB, the writeback code may be improved somehow to not exceed the
> nr_pages limit too much.
> 

What might be a useful value for N? 1.5 appears to work reasonably well
to create a window of writeback ahead of the scanner but it's a bit
arbitrary.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-26 12:57           ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-26 12:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > >  	}
> > > >  
> > > > +	/*
> > > > +	 * If reclaim is encountering dirty pages, it may be because
> > > > +	 * dirty pages are reaching the end of the LRU even though
> > > > +	 * the dirty_ratio may be satisified. In this case, wake
> > > > +	 * flusher threads to pro-actively clean some pages
> > > > +	 */
> > > > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > 
> > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > number of dirty pages down to 0 whether or not pageout() is called.
> > > 
> > 
> > True, this has been fixed to only wakeup flusher threads when this is
> > the file LRU, dirty pages have been encountered and the caller has
> > sc->may_writepage.
> 
> OK.
> 
> > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > for efficiency.
> > > And it seems good to let the flusher write much more
> > > than nr_dirty pages to safeguard a reasonable large
> > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > update the comments.
> > > 
> > 
> > Ok, the reasoning had been to flush a number of pages that was related
> > to the scanning rate but if that is inefficient for the flusher, I'll
> > use MAX_WRITEBACK_PAGES.
> 
> It would be better to pass something like (nr_dirty * N).
> MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> obviously too large as a parameter. When the batch size is increased
> to 128MB, the writeback code may be improved somehow to not exceed the
> nr_pages limit too much.
> 

What might be a useful value for N? 1.5 appears to work reasonably well
to create a window of writeback ahead of the scanner but it's a bit
arbitrary.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-26 12:53                       ` Mel Gorman
@ 2010-07-26 13:03                         ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26 13:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli

On Mon, Jul 26, 2010 at 08:53:26PM +0800, Mel Gorman wrote:
> On Mon, Jul 26, 2010 at 07:19:53PM +0800, Wu Fengguang wrote:
> > On Mon, Jul 26, 2010 at 05:12:27PM +0800, Mel Gorman wrote:
> > > On Mon, Jul 26, 2010 at 04:29:35PM +0800, Wu Fengguang wrote:
> > > > > ==== CUT HERE ====
> > > > > vmscan: Do not writeback filesystem pages in direct reclaim
> > > > > 
> > > > > When memory is under enough pressure, a process may enter direct
> > > > > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > > > > encountered during the scan, this page is written to backing storage using
> > > > > mapping->writepage. This can result in very deep call stacks, particularly
> > > > > if the target storage or filesystem are complex. It has already been observed
> > > > > on XFS that the stack overflows but the problem is not XFS-specific.
> > > > > 
> > > > > This patch prevents direct reclaim writing back filesystem pages by checking
> > > > > if current is kswapd or the page is anonymous before writing back.  If the
> > > > > dirty pages cannot be written back, they are placed back on the LRU lists
> > > > > for either background writing by the BDI threads or kswapd. If in direct
> > > > > lumpy reclaim and dirty pages are encountered, the process will stall for
> > > > > the background flusher before trying to reclaim the pages again.
> > > > > 
> > > > > As the call-chain for writing anonymous pages is not expected to be deep
> > > > > and they are not cleaned by flusher threads, anonymous pages are still
> > > > > written back in direct reclaim.
> > > > 
> > > > This is also a good step towards reducing pageout() calls. For better
> > > > IO performance the flusher threads should take more work from pageout().
> > > > 
> > > 
> > > This is true for better IO performance all right but reclaim does require
> > > specific pages cleaned. The strict requirement is when lumpy reclaim is
> > > involved but a looser requirement is when any pages within a zone be cleaned.
> > 
> > Good point, I missed the lumpy reclaim requirement. It seems necessary
> > to add a call to the flusher thread to writeback a specific inode range
> > (that contains the current dirty page). This is a more reliable way to
> > ensure both the strict and looser requirements: the current dirty page
> > will guaranteed to be synced, and the inode will have good opportunity 
> > to contain more dirty pages in the zone, which can be freed quickly if
> > tagged PG_reclaim.
> > 
> 
> I'm not sure about an inode range. The window being considered is quite small
> and we might select ranges that are too small to be useful.  However, taking

We don't need to pass the range. We only pass the page offset, and let
the flusher thread select the approach range that covers our target page.

This guarantees the target page will be served.

> the inodes into account makes sense. If wakeup_flusher_thread took a list
> of unique inodes that own dirty pages encountered by reclaim, it would then
> move inodes to the head of the queue rather than depending just on expired.

The flusher thread may internally queue all older inodes than this one for IO. 
But sure it'd better serve the target inode first to avoid adding delays.

> 
> > > > > <SNIP>
> > > > >
> > > > > @@ -1293,26 +1308,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> > > > > 
> > > > >         spin_unlock_irq(&zone->lru_lock);
> > > > > 
> > > > > -       nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> > > > > +       nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> > > > > +                                                               &nr_dirty);
> > > > > 
> > > > >         /*
> > > > > -        * If we are direct reclaiming for contiguous pages and we do
> > > > > +        * If specific pages are needed such as with direct reclaiming
> > > > > +        * for contiguous pages or for memory containers and we do
> > > > >          * not reclaim everything in the list, try again and wait
> > > > > -        * for IO to complete. This will stall high-order allocations
> > > > > -        * but that should be acceptable to the caller
> > > > > +        * for IO to complete. This will stall callers that require
> > > > > +        * specific pages but it should be acceptable to the caller
> > > > >          */
> > > > > -       if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> > > > > -                       sc->lumpy_reclaim_mode) {
> > > > > -               congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > > +       if (sc->may_writepage && !current_is_kswapd() &&
> > > > > +                       (sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> > > > > +               int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> > > > > 
> > > > > -               /*
> > > > > -                * The attempt at page out may have made some
> > > > > -                * of the pages active, mark them inactive again.
> > > > > -                */
> > > > > -               nr_active = clear_active_flags(&page_list, NULL);
> > > > > -               count_vm_events(PGDEACTIVATE, nr_active);
> > > > > +               while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > > > > +                       wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> > > > > +                       congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > 
> > > > It needs good luck for the flusher threads to "happen to" sync the
> > > > dirty pages in our page_list.
> > > 
> > > That is why I'm expecting the "shrink oldest inode" patchset to help. It
> > > still requires a certain amount of luck but callers that encounter dirty
> > > pages will be delayed.
> > > 
> > > It's also because a certain amount of luck is required that the last patch
> > > in the series aims at reducing the number of dirty pages encountered by
> > > reclaim. The closer that is to 0, the less important the timing of flusher
> > > threads is.
> > 
> > OK.
> > 
> > > > I'd rather take the logic as "there are
> > > > too many dirty pages, shrink them to avoid some future pageout() calls
> > > > and/or congestion_wait() stalls".
> > > > 
> > > 
> > > What do you mean by shrink them? They cannot be reclaimed until they are
> > > clean.
> > 
> > I mean we are freeing much more than nr_dirty pages. In this sense we
> > are shrinking the number of dirty pages. Note that we are calling
> > wakeup_flusher_threads(nr_dirty), however the real synced pages will
> > be much more than nr_dirty, that is reasonable good behavior.
> > 
> 
> Ok.
> 
> > > > So the loop is likely to repeat MAX_SWAP_CLEAN_WAIT times.  Let's remove it?
> > > > 
> > > 
> > > This loop only applies to direct reclaimers in lumpy reclaim mode and
> > > memory containers. Both need specific pages to be cleaned and freed.
> > > Hence, the loop is to stall them and wait on flusher threads up to a
> > > point. Otherwise they can cause a reclaim storm of clean pages that
> > > can't be used.
> > 
> > Agreed. We could call the flusher to sync the inode explicitly, as
> > recommended above. This will clean and free (with PG_reclaim) the page
> > in seconds. With reasonable waits here we may avoid reclaim storm
> > effectively.
> > 
> 
> I'll follow this suggestion as a new patch.
> 
> > > Current tests have not indicated MAX_SWAP_CLEAN_WAIT is regularly reached
> > > but I am inferring this from timing data rather than a direct measurement.
> > > 
> > > > > -               nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > > > > +                       /*
> > > > > +                        * The attempt at page out may have made some
> > > > > +                        * of the pages active, mark them inactive again.
> > > > > +                        */
> > > > > +                       nr_active = clear_active_flags(&page_list, NULL);
> > > > > +                       count_vm_events(PGDEACTIVATE, nr_active);
> > > > > +
> > > > > +                       nr_reclaimed += shrink_page_list(&page_list, sc,
> > > > > +                                               PAGEOUT_IO_SYNC, &nr_dirty);
> > > > 
> > > > This shrink_page_list() won't be called at all if nr_dirty==0 and
> > > > pageout() was called. This is a change of behavior. It can also be
> > > > fixed by removing the loop.
> > > > 
> > > 
> > > The whole patch is a change of behaviour but in this case it also makes
> > > sense to focus on just the dirty pages. The first shrink_page_list
> > > decided that the pages could not be unmapped and reclaimed - probably
> > > because it was referenced. This is not likely to change during the loop.
> > 
> > Agreed.
> > 
> > > Testing with a version of the patch that processed the full list added
> > > significant stalls when sync writeback was involved. Testing time length
> > > was tripled in one case implying that this loop was continually reaching
> > > MAX_SWAP_CLEAN_WAIT.
> > 
> > I'm OK with the change actually, this removes one not-that-user-friendly
> > wait_on_page_writeback().
> > 
> > > The intention of this loop is "wait on dirty pages to be cleaned" and
> > > it's a change of behaviour, but one that makes sense and testing
> > > indicates it's a good idea.
> > 
> > I mean, this loop may be unwinded. And we may need another loop to
> > sync the inodes that contains the dirty pages.
> > 
> 
> I'm not quite sure what you mean here but I think it might tie into the
> idea of passing a list of inodes to the flusher threads. Lets see what
> that ends up looking like.

OK.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-26 13:03                         ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26 13:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrew Morton, Andrea Arcangeli

On Mon, Jul 26, 2010 at 08:53:26PM +0800, Mel Gorman wrote:
> On Mon, Jul 26, 2010 at 07:19:53PM +0800, Wu Fengguang wrote:
> > On Mon, Jul 26, 2010 at 05:12:27PM +0800, Mel Gorman wrote:
> > > On Mon, Jul 26, 2010 at 04:29:35PM +0800, Wu Fengguang wrote:
> > > > > ==== CUT HERE ====
> > > > > vmscan: Do not writeback filesystem pages in direct reclaim
> > > > > 
> > > > > When memory is under enough pressure, a process may enter direct
> > > > > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > > > > encountered during the scan, this page is written to backing storage using
> > > > > mapping->writepage. This can result in very deep call stacks, particularly
> > > > > if the target storage or filesystem are complex. It has already been observed
> > > > > on XFS that the stack overflows but the problem is not XFS-specific.
> > > > > 
> > > > > This patch prevents direct reclaim writing back filesystem pages by checking
> > > > > if current is kswapd or the page is anonymous before writing back.  If the
> > > > > dirty pages cannot be written back, they are placed back on the LRU lists
> > > > > for either background writing by the BDI threads or kswapd. If in direct
> > > > > lumpy reclaim and dirty pages are encountered, the process will stall for
> > > > > the background flusher before trying to reclaim the pages again.
> > > > > 
> > > > > As the call-chain for writing anonymous pages is not expected to be deep
> > > > > and they are not cleaned by flusher threads, anonymous pages are still
> > > > > written back in direct reclaim.
> > > > 
> > > > This is also a good step towards reducing pageout() calls. For better
> > > > IO performance the flusher threads should take more work from pageout().
> > > > 
> > > 
> > > This is true for better IO performance all right but reclaim does require
> > > specific pages cleaned. The strict requirement is when lumpy reclaim is
> > > involved but a looser requirement is when any pages within a zone be cleaned.
> > 
> > Good point, I missed the lumpy reclaim requirement. It seems necessary
> > to add a call to the flusher thread to writeback a specific inode range
> > (that contains the current dirty page). This is a more reliable way to
> > ensure both the strict and looser requirements: the current dirty page
> > will guaranteed to be synced, and the inode will have good opportunity 
> > to contain more dirty pages in the zone, which can be freed quickly if
> > tagged PG_reclaim.
> > 
> 
> I'm not sure about an inode range. The window being considered is quite small
> and we might select ranges that are too small to be useful.  However, taking

We don't need to pass the range. We only pass the page offset, and let
the flusher thread select the approach range that covers our target page.

This guarantees the target page will be served.

> the inodes into account makes sense. If wakeup_flusher_thread took a list
> of unique inodes that own dirty pages encountered by reclaim, it would then
> move inodes to the head of the queue rather than depending just on expired.

The flusher thread may internally queue all older inodes than this one for IO. 
But sure it'd better serve the target inode first to avoid adding delays.

> 
> > > > > <SNIP>
> > > > >
> > > > > @@ -1293,26 +1308,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> > > > > 
> > > > >         spin_unlock_irq(&zone->lru_lock);
> > > > > 
> > > > > -       nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> > > > > +       nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> > > > > +                                                               &nr_dirty);
> > > > > 
> > > > >         /*
> > > > > -        * If we are direct reclaiming for contiguous pages and we do
> > > > > +        * If specific pages are needed such as with direct reclaiming
> > > > > +        * for contiguous pages or for memory containers and we do
> > > > >          * not reclaim everything in the list, try again and wait
> > > > > -        * for IO to complete. This will stall high-order allocations
> > > > > -        * but that should be acceptable to the caller
> > > > > +        * for IO to complete. This will stall callers that require
> > > > > +        * specific pages but it should be acceptable to the caller
> > > > >          */
> > > > > -       if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> > > > > -                       sc->lumpy_reclaim_mode) {
> > > > > -               congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > > +       if (sc->may_writepage && !current_is_kswapd() &&
> > > > > +                       (sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> > > > > +               int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> > > > > 
> > > > > -               /*
> > > > > -                * The attempt at page out may have made some
> > > > > -                * of the pages active, mark them inactive again.
> > > > > -                */
> > > > > -               nr_active = clear_active_flags(&page_list, NULL);
> > > > > -               count_vm_events(PGDEACTIVATE, nr_active);
> > > > > +               while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > > > > +                       wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> > > > > +                       congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > 
> > > > It needs good luck for the flusher threads to "happen to" sync the
> > > > dirty pages in our page_list.
> > > 
> > > That is why I'm expecting the "shrink oldest inode" patchset to help. It
> > > still requires a certain amount of luck but callers that encounter dirty
> > > pages will be delayed.
> > > 
> > > It's also because a certain amount of luck is required that the last patch
> > > in the series aims at reducing the number of dirty pages encountered by
> > > reclaim. The closer that is to 0, the less important the timing of flusher
> > > threads is.
> > 
> > OK.
> > 
> > > > I'd rather take the logic as "there are
> > > > too many dirty pages, shrink them to avoid some future pageout() calls
> > > > and/or congestion_wait() stalls".
> > > > 
> > > 
> > > What do you mean by shrink them? They cannot be reclaimed until they are
> > > clean.
> > 
> > I mean we are freeing much more than nr_dirty pages. In this sense we
> > are shrinking the number of dirty pages. Note that we are calling
> > wakeup_flusher_threads(nr_dirty), however the real synced pages will
> > be much more than nr_dirty, that is reasonable good behavior.
> > 
> 
> Ok.
> 
> > > > So the loop is likely to repeat MAX_SWAP_CLEAN_WAIT times.  Let's remove it?
> > > > 
> > > 
> > > This loop only applies to direct reclaimers in lumpy reclaim mode and
> > > memory containers. Both need specific pages to be cleaned and freed.
> > > Hence, the loop is to stall them and wait on flusher threads up to a
> > > point. Otherwise they can cause a reclaim storm of clean pages that
> > > can't be used.
> > 
> > Agreed. We could call the flusher to sync the inode explicitly, as
> > recommended above. This will clean and free (with PG_reclaim) the page
> > in seconds. With reasonable waits here we may avoid reclaim storm
> > effectively.
> > 
> 
> I'll follow this suggestion as a new patch.
> 
> > > Current tests have not indicated MAX_SWAP_CLEAN_WAIT is regularly reached
> > > but I am inferring this from timing data rather than a direct measurement.
> > > 
> > > > > -               nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > > > > +                       /*
> > > > > +                        * The attempt at page out may have made some
> > > > > +                        * of the pages active, mark them inactive again.
> > > > > +                        */
> > > > > +                       nr_active = clear_active_flags(&page_list, NULL);
> > > > > +                       count_vm_events(PGDEACTIVATE, nr_active);
> > > > > +
> > > > > +                       nr_reclaimed += shrink_page_list(&page_list, sc,
> > > > > +                                               PAGEOUT_IO_SYNC, &nr_dirty);
> > > > 
> > > > This shrink_page_list() won't be called at all if nr_dirty==0 and
> > > > pageout() was called. This is a change of behavior. It can also be
> > > > fixed by removing the loop.
> > > > 
> > > 
> > > The whole patch is a change of behaviour but in this case it also makes
> > > sense to focus on just the dirty pages. The first shrink_page_list
> > > decided that the pages could not be unmapped and reclaimed - probably
> > > because it was referenced. This is not likely to change during the loop.
> > 
> > Agreed.
> > 
> > > Testing with a version of the patch that processed the full list added
> > > significant stalls when sync writeback was involved. Testing time length
> > > was tripled in one case implying that this loop was continually reaching
> > > MAX_SWAP_CLEAN_WAIT.
> > 
> > I'm OK with the change actually, this removes one not-that-user-friendly
> > wait_on_page_writeback().
> > 
> > > The intention of this loop is "wait on dirty pages to be cleaned" and
> > > it's a change of behaviour, but one that makes sense and testing
> > > indicates it's a good idea.
> > 
> > I mean, this loop may be unwinded. And we may need another loop to
> > sync the inodes that contains the dirty pages.
> > 
> 
> I'm not quite sure what you mean here but I think it might tie into the
> idea of passing a list of inodes to the flusher threads. Lets see what
> that ends up looking like.

OK.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-26 12:57           ` Mel Gorman
@ 2010-07-26 13:10             ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26 13:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote:
> On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > > >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > > >  	}
> > > > >  
> > > > > +	/*
> > > > > +	 * If reclaim is encountering dirty pages, it may be because
> > > > > +	 * dirty pages are reaching the end of the LRU even though
> > > > > +	 * the dirty_ratio may be satisified. In this case, wake
> > > > > +	 * flusher threads to pro-actively clean some pages
> > > > > +	 */
> > > > > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > > 
> > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > > number of dirty pages down to 0 whether or not pageout() is called.
> > > > 
> > > 
> > > True, this has been fixed to only wakeup flusher threads when this is
> > > the file LRU, dirty pages have been encountered and the caller has
> > > sc->may_writepage.
> > 
> > OK.
> > 
> > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > > for efficiency.
> > > > And it seems good to let the flusher write much more
> > > > than nr_dirty pages to safeguard a reasonable large
> > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > > update the comments.
> > > > 
> > > 
> > > Ok, the reasoning had been to flush a number of pages that was related
> > > to the scanning rate but if that is inefficient for the flusher, I'll
> > > use MAX_WRITEBACK_PAGES.
> > 
> > It would be better to pass something like (nr_dirty * N).
> > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> > obviously too large as a parameter. When the batch size is increased
> > to 128MB, the writeback code may be improved somehow to not exceed the
> > nr_pages limit too much.
> > 
> 
> What might be a useful value for N? 1.5 appears to work reasonably well
> to create a window of writeback ahead of the scanner but it's a bit
> arbitrary.

I'd recommend N to be a large value. It's no longer relevant now since
we'll call the flusher to sync some range containing the target page.
The flusher will then choose an N large enough (eg. 4MB) for efficient
IO. It needs to be a large value, otherwise the vmscan code will
quickly run into dirty pages again..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-26 13:10             ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26 13:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote:
> On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > > >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > > >  	}
> > > > >  
> > > > > +	/*
> > > > > +	 * If reclaim is encountering dirty pages, it may be because
> > > > > +	 * dirty pages are reaching the end of the LRU even though
> > > > > +	 * the dirty_ratio may be satisified. In this case, wake
> > > > > +	 * flusher threads to pro-actively clean some pages
> > > > > +	 */
> > > > > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > > 
> > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > > number of dirty pages down to 0 whether or not pageout() is called.
> > > > 
> > > 
> > > True, this has been fixed to only wakeup flusher threads when this is
> > > the file LRU, dirty pages have been encountered and the caller has
> > > sc->may_writepage.
> > 
> > OK.
> > 
> > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > > for efficiency.
> > > > And it seems good to let the flusher write much more
> > > > than nr_dirty pages to safeguard a reasonable large
> > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > > update the comments.
> > > > 
> > > 
> > > Ok, the reasoning had been to flush a number of pages that was related
> > > to the scanning rate but if that is inefficient for the flusher, I'll
> > > use MAX_WRITEBACK_PAGES.
> > 
> > It would be better to pass something like (nr_dirty * N).
> > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> > obviously too large as a parameter. When the batch size is increased
> > to 128MB, the writeback code may be improved somehow to not exceed the
> > nr_pages limit too much.
> > 
> 
> What might be a useful value for N? 1.5 appears to work reasonably well
> to create a window of writeback ahead of the scanner but it's a bit
> arbitrary.

I'd recommend N to be a large value. It's no longer relevant now since
we'll call the flusher to sync some range containing the target page.
The flusher will then choose an N large enough (eg. 4MB) for efficient
IO. It needs to be a large value, otherwise the vmscan code will
quickly run into dirty pages again..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-26  4:37                           ` Wu Fengguang
  (?)
@ 2010-07-26 16:30                             ` Minchan Kim
  -1 siblings, 0 replies; 177+ messages in thread
From: Minchan Kim @ 2010-07-26 16:30 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 26, 2010 at 12:37:09PM +0800, Wu Fengguang wrote:
> On Mon, Jul 26, 2010 at 12:11:59PM +0800, Minchan Kim wrote:
> > On Mon, Jul 26, 2010 at 12:27 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > On Sun, Jul 25, 2010 at 08:03:45PM +0800, Minchan Kim wrote:
> > >> On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote:
> > >> > Hi
> > >> >
> > >> > sorry for the delay.
> > >> >
> > >> > > Will you be picking it up or should I? The changelog should be more or less
> > >> > > the same as yours and consider it
> > >> > >
> > >> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > >> > >
> > >> > > It'd be nice if the original tester is still knocking around and willing
> > >> > > to confirm the patch resolves his/her problem. I am running this patch on
> > >> > > my desktop at the moment and it does feel a little smoother but it might be
> > >> > > my imagination. I had trouble with odd stalls that I never pinned down and
> > >> > > was attributing to the machine being commonly heavily loaded but I haven't
> > >> > > noticed them today.
> > >> > >
> > >> > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
> > >> > > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
> > >> > > should use PAGEOUT_IO_SYNC]
> > >> >
> > >> > My reviewing doesn't found any bug. however I think original thread have too many guess
> > >> > and we need to know reproduce way and confirm it.
> > >> >
> > >> > At least, we need three confirms.
> > >> >  o original issue is still there?
> > >> >  o DEF_PRIORITY/3 is best value?
> > >>
> > >> I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU?
> > >> I guess system has 512M and 22M writeback pages.
> > >> So you may determine it for skipping max 32M writeback pages.
> > >> Is right?
> > >
> > > For 512M mem, DEF_PRIORITY/3 means 32M dirty _or_ writeback pages.
> > > Because shrink_inactive_list() first calls
> > > shrink_page_list(PAGEOUT_IO_ASYNC) then optionally
> > > shrink_page_list(PAGEOUT_IO_SYNC), so dirty pages will first be
> > > converted to writeback pages and then optionally be waited on.
> > >
> > > The dirty/writeback pages may go up to 512M*20% = 100M. So 32M looks
> > > a reasonable value.
> > 
> > Why do you think it's a reasonable value?
> > I mean why isn't it good 12.5% or 3.125%? Why do you select 6.25%?
> > I am not against you. Just out of curiosity and requires more explanation.
> > It might be thing _only I_ don't know. :(
> 
> It's more or less random selected. I'm also OK with 3.125%. It's an
> threshold to turn on some _last resort_ mechanism, so don't need to be
> optimal..

Okay. Why I had a question is that I don't want to add new magic value in 
VM without detailed comment. 
While I review the source code, I always suffer form it. :(
Now we have a great tool called 'git'. 
Please write down why we select that number detaily when we add new 
magic value. :)

Thanks, Wu. 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-26 16:30                             ` Minchan Kim
  0 siblings, 0 replies; 177+ messages in thread
From: Minchan Kim @ 2010-07-26 16:30 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 26, 2010 at 12:37:09PM +0800, Wu Fengguang wrote:
> On Mon, Jul 26, 2010 at 12:11:59PM +0800, Minchan Kim wrote:
> > On Mon, Jul 26, 2010 at 12:27 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > On Sun, Jul 25, 2010 at 08:03:45PM +0800, Minchan Kim wrote:
> > >> On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote:
> > >> > Hi
> > >> >
> > >> > sorry for the delay.
> > >> >
> > >> > > Will you be picking it up or should I? The changelog should be more or less
> > >> > > the same as yours and consider it
> > >> > >
> > >> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > >> > >
> > >> > > It'd be nice if the original tester is still knocking around and willing
> > >> > > to confirm the patch resolves his/her problem. I am running this patch on
> > >> > > my desktop at the moment and it does feel a little smoother but it might be
> > >> > > my imagination. I had trouble with odd stalls that I never pinned down and
> > >> > > was attributing to the machine being commonly heavily loaded but I haven't
> > >> > > noticed them today.
> > >> > >
> > >> > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
> > >> > > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
> > >> > > should use PAGEOUT_IO_SYNC]
> > >> >
> > >> > My reviewing doesn't found any bug. however I think original thread have too many guess
> > >> > and we need to know reproduce way and confirm it.
> > >> >
> > >> > At least, we need three confirms.
> > >> >  o original issue is still there?
> > >> >  o DEF_PRIORITY/3 is best value?
> > >>
> > >> I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU?
> > >> I guess system has 512M and 22M writeback pages.
> > >> So you may determine it for skipping max 32M writeback pages.
> > >> Is right?
> > >
> > > For 512M mem, DEF_PRIORITY/3 means 32M dirty _or_ writeback pages.
> > > Because shrink_inactive_list() first calls
> > > shrink_page_list(PAGEOUT_IO_ASYNC) then optionally
> > > shrink_page_list(PAGEOUT_IO_SYNC), so dirty pages will first be
> > > converted to writeback pages and then optionally be waited on.
> > >
> > > The dirty/writeback pages may go up to 512M*20% = 100M. So 32M looks
> > > a reasonable value.
> > 
> > Why do you think it's a reasonable value?
> > I mean why isn't it good 12.5% or 3.125%? Why do you select 6.25%?
> > I am not against you. Just out of curiosity and requires more explanation.
> > It might be thing _only I_ don't know. :(
> 
> It's more or less random selected. I'm also OK with 3.125%. It's an
> threshold to turn on some _last resort_ mechanism, so don't need to be
> optimal..

Okay. Why I had a question is that I don't want to add new magic value in 
VM without detailed comment. 
While I review the source code, I always suffer form it. :(
Now we have a great tool called 'git'. 
Please write down why we select that number detaily when we add new 
magic value. :)

Thanks, Wu. 

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-26 16:30                             ` Minchan Kim
  0 siblings, 0 replies; 177+ messages in thread
From: Minchan Kim @ 2010-07-26 16:30 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 26, 2010 at 12:37:09PM +0800, Wu Fengguang wrote:
> On Mon, Jul 26, 2010 at 12:11:59PM +0800, Minchan Kim wrote:
> > On Mon, Jul 26, 2010 at 12:27 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > On Sun, Jul 25, 2010 at 08:03:45PM +0800, Minchan Kim wrote:
> > >> On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote:
> > >> > Hi
> > >> >
> > >> > sorry for the delay.
> > >> >
> > >> > > Will you be picking it up or should I? The changelog should be more or less
> > >> > > the same as yours and consider it
> > >> > >
> > >> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > >> > >
> > >> > > It'd be nice if the original tester is still knocking around and willing
> > >> > > to confirm the patch resolves his/her problem. I am running this patch on
> > >> > > my desktop at the moment and it does feel a little smoother but it might be
> > >> > > my imagination. I had trouble with odd stalls that I never pinned down and
> > >> > > was attributing to the machine being commonly heavily loaded but I haven't
> > >> > > noticed them today.
> > >> > >
> > >> > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
> > >> > > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
> > >> > > should use PAGEOUT_IO_SYNC]
> > >> >
> > >> > My reviewing doesn't found any bug. however I think original thread have too many guess
> > >> > and we need to know reproduce way and confirm it.
> > >> >
> > >> > At least, we need three confirms.
> > >> >  o original issue is still there?
> > >> >  o DEF_PRIORITY/3 is best value?
> > >>
> > >> I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU?
> > >> I guess system has 512M and 22M writeback pages.
> > >> So you may determine it for skipping max 32M writeback pages.
> > >> Is right?
> > >
> > > For 512M mem, DEF_PRIORITY/3 means 32M dirty _or_ writeback pages.
> > > Because shrink_inactive_list() first calls
> > > shrink_page_list(PAGEOUT_IO_ASYNC) then optionally
> > > shrink_page_list(PAGEOUT_IO_SYNC), so dirty pages will first be
> > > converted to writeback pages and then optionally be waited on.
> > >
> > > The dirty/writeback pages may go up to 512M*20% = 100M. So 32M looks
> > > a reasonable value.
> > 
> > Why do you think it's a reasonable value?
> > I mean why isn't it good 12.5% or 3.125%? Why do you select 6.25%?
> > I am not against you. Just out of curiosity and requires more explanation.
> > It might be thing _only I_ don't know. :(
> 
> It's more or less random selected. I'm also OK with 3.125%. It's an
> threshold to turn on some _last resort_ mechanism, so don't need to be
> optimal..

Okay. Why I had a question is that I don't want to add new magic value in 
VM without detailed comment. 
While I review the source code, I always suffer form it. :(
Now we have a great tool called 'git'. 
Please write down why we select that number detaily when we add new 
magic value. :)

Thanks, Wu. 

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
  2010-07-26 16:30                             ` Minchan Kim
  (?)
@ 2010-07-26 22:48                               ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26 22:48 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli

On Tue, Jul 27, 2010 at 12:30:11AM +0800, Minchan Kim wrote:
> On Mon, Jul 26, 2010 at 12:37:09PM +0800, Wu Fengguang wrote:
> > On Mon, Jul 26, 2010 at 12:11:59PM +0800, Minchan Kim wrote:
> > > On Mon, Jul 26, 2010 at 12:27 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > > On Sun, Jul 25, 2010 at 08:03:45PM +0800, Minchan Kim wrote:
> > > >> On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote:
> > > >> > Hi
> > > >> >
> > > >> > sorry for the delay.
> > > >> >
> > > >> > > Will you be picking it up or should I? The changelog should be more or less
> > > >> > > the same as yours and consider it
> > > >> > >
> > > >> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > >> > >
> > > >> > > It'd be nice if the original tester is still knocking around and willing
> > > >> > > to confirm the patch resolves his/her problem. I am running this patch on
> > > >> > > my desktop at the moment and it does feel a little smoother but it might be
> > > >> > > my imagination. I had trouble with odd stalls that I never pinned down and
> > > >> > > was attributing to the machine being commonly heavily loaded but I haven't
> > > >> > > noticed them today.
> > > >> > >
> > > >> > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
> > > >> > > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
> > > >> > > should use PAGEOUT_IO_SYNC]
> > > >> >
> > > >> > My reviewing doesn't found any bug. however I think original thread have too many guess
> > > >> > and we need to know reproduce way and confirm it.
> > > >> >
> > > >> > At least, we need three confirms.
> > > >> >  o original issue is still there?
> > > >> >  o DEF_PRIORITY/3 is best value?
> > > >>
> > > >> I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU?
> > > >> I guess system has 512M and 22M writeback pages.
> > > >> So you may determine it for skipping max 32M writeback pages.
> > > >> Is right?
> > > >
> > > > For 512M mem, DEF_PRIORITY/3 means 32M dirty _or_ writeback pages.
> > > > Because shrink_inactive_list() first calls
> > > > shrink_page_list(PAGEOUT_IO_ASYNC) then optionally
> > > > shrink_page_list(PAGEOUT_IO_SYNC), so dirty pages will first be
> > > > converted to writeback pages and then optionally be waited on.
> > > >
> > > > The dirty/writeback pages may go up to 512M*20% = 100M. So 32M looks
> > > > a reasonable value.
> > > 
> > > Why do you think it's a reasonable value?
> > > I mean why isn't it good 12.5% or 3.125%? Why do you select 6.25%?
> > > I am not against you. Just out of curiosity and requires more explanation.
> > > It might be thing _only I_ don't know. :(
> > 
> > It's more or less random selected. I'm also OK with 3.125%. It's an
> > threshold to turn on some _last resort_ mechanism, so don't need to be
> > optimal..
> 
> Okay. Why I had a question is that I don't want to add new magic value in 
> VM without detailed comment. 
> While I review the source code, I always suffer form it. :(
> Now we have a great tool called 'git'. 
> Please write down why we select that number detaily when we add new 
> magic value. :)

Good point. I'll do that :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-26 22:48                               ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26 22:48 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli

On Tue, Jul 27, 2010 at 12:30:11AM +0800, Minchan Kim wrote:
> On Mon, Jul 26, 2010 at 12:37:09PM +0800, Wu Fengguang wrote:
> > On Mon, Jul 26, 2010 at 12:11:59PM +0800, Minchan Kim wrote:
> > > On Mon, Jul 26, 2010 at 12:27 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > > On Sun, Jul 25, 2010 at 08:03:45PM +0800, Minchan Kim wrote:
> > > >> On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote:
> > > >> > Hi
> > > >> >
> > > >> > sorry for the delay.
> > > >> >
> > > >> > > Will you be picking it up or should I? The changelog should be more or less
> > > >> > > the same as yours and consider it
> > > >> > >
> > > >> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > >> > >
> > > >> > > It'd be nice if the original tester is still knocking around and willing
> > > >> > > to confirm the patch resolves his/her problem. I am running this patch on
> > > >> > > my desktop at the moment and it does feel a little smoother but it might be
> > > >> > > my imagination. I had trouble with odd stalls that I never pinned down and
> > > >> > > was attributing to the machine being commonly heavily loaded but I haven't
> > > >> > > noticed them today.
> > > >> > >
> > > >> > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
> > > >> > > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
> > > >> > > should use PAGEOUT_IO_SYNC]
> > > >> >
> > > >> > My reviewing doesn't found any bug. however I think original thread have too many guess
> > > >> > and we need to know reproduce way and confirm it.
> > > >> >
> > > >> > At least, we need three confirms.
> > > >> >  o original issue is still there?
> > > >> >  o DEF_PRIORITY/3 is best value?
> > > >>
> > > >> I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU?
> > > >> I guess system has 512M and 22M writeback pages.
> > > >> So you may determine it for skipping max 32M writeback pages.
> > > >> Is right?
> > > >
> > > > For 512M mem, DEF_PRIORITY/3 means 32M dirty _or_ writeback pages.
> > > > Because shrink_inactive_list() first calls
> > > > shrink_page_list(PAGEOUT_IO_ASYNC) then optionally
> > > > shrink_page_list(PAGEOUT_IO_SYNC), so dirty pages will first be
> > > > converted to writeback pages and then optionally be waited on.
> > > >
> > > > The dirty/writeback pages may go up to 512M*20% = 100M. So 32M looks
> > > > a reasonable value.
> > > 
> > > Why do you think it's a reasonable value?
> > > I mean why isn't it good 12.5% or 3.125%? Why do you select 6.25%?
> > > I am not against you. Just out of curiosity and requires more explanation.
> > > It might be thing _only I_ don't know. :(
> > 
> > It's more or less random selected. I'm also OK with 3.125%. It's an
> > threshold to turn on some _last resort_ mechanism, so don't need to be
> > optimal..
> 
> Okay. Why I had a question is that I don't want to add new magic value in 
> VM without detailed comment. 
> While I review the source code, I always suffer form it. :(
> Now we have a great tool called 'git'. 
> Please write down why we select that number detaily when we add new 
> magic value. :)

Good point. I'll do that :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback
@ 2010-07-26 22:48                               ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-26 22:48 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton,
	Andrea Arcangeli

On Tue, Jul 27, 2010 at 12:30:11AM +0800, Minchan Kim wrote:
> On Mon, Jul 26, 2010 at 12:37:09PM +0800, Wu Fengguang wrote:
> > On Mon, Jul 26, 2010 at 12:11:59PM +0800, Minchan Kim wrote:
> > > On Mon, Jul 26, 2010 at 12:27 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > > On Sun, Jul 25, 2010 at 08:03:45PM +0800, Minchan Kim wrote:
> > > >> On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote:
> > > >> > Hi
> > > >> >
> > > >> > sorry for the delay.
> > > >> >
> > > >> > > Will you be picking it up or should I? The changelog should be more or less
> > > >> > > the same as yours and consider it
> > > >> > >
> > > >> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > >> > >
> > > >> > > It'd be nice if the original tester is still knocking around and willing
> > > >> > > to confirm the patch resolves his/her problem. I am running this patch on
> > > >> > > my desktop at the moment and it does feel a little smoother but it might be
> > > >> > > my imagination. I had trouble with odd stalls that I never pinned down and
> > > >> > > was attributing to the machine being commonly heavily loaded but I haven't
> > > >> > > noticed them today.
> > > >> > >
> > > >> > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
> > > >> > > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
> > > >> > > should use PAGEOUT_IO_SYNC]
> > > >> >
> > > >> > My reviewing doesn't found any bug. however I think original thread have too many guess
> > > >> > and we need to know reproduce way and confirm it.
> > > >> >
> > > >> > At least, we need three confirms.
> > > >> > A o original issue is still there?
> > > >> > A o DEF_PRIORITY/3 is best value?
> > > >>
> > > >> I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU?
> > > >> I guess system has 512M and 22M writeback pages.
> > > >> So you may determine it for skipping max 32M writeback pages.
> > > >> Is right?
> > > >
> > > > For 512M mem, DEF_PRIORITY/3 means 32M dirty _or_ writeback pages.
> > > > Because shrink_inactive_list() first calls
> > > > shrink_page_list(PAGEOUT_IO_ASYNC) then optionally
> > > > shrink_page_list(PAGEOUT_IO_SYNC), so dirty pages will first be
> > > > converted to writeback pages and then optionally be waited on.
> > > >
> > > > The dirty/writeback pages may go up to 512M*20% = 100M. So 32M looks
> > > > a reasonable value.
> > > 
> > > Why do you think it's a reasonable value?
> > > I mean why isn't it good 12.5% or 3.125%? Why do you select 6.25%?
> > > I am not against you. Just out of curiosity and requires more explanation.
> > > It might be thing _only I_ don't know. :(
> > 
> > It's more or less random selected. I'm also OK with 3.125%. It's an
> > threshold to turn on some _last resort_ mechanism, so don't need to be
> > optimal..
> 
> Okay. Why I had a question is that I don't want to add new magic value in 
> VM without detailed comment. 
> While I review the source code, I always suffer form it. :(
> Now we have a great tool called 'git'. 
> Please write down why we select that number detaily when we add new 
> magic value. :)

Good point. I'll do that :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-26 13:10             ` Wu Fengguang
@ 2010-07-27 13:35               ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-27 13:35 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 26, 2010 at 09:10:08PM +0800, Wu Fengguang wrote:
> On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote:
> > On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > > > >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > > > >  	}
> > > > > >  
> > > > > > +	/*
> > > > > > +	 * If reclaim is encountering dirty pages, it may be because
> > > > > > +	 * dirty pages are reaching the end of the LRU even though
> > > > > > +	 * the dirty_ratio may be satisified. In this case, wake
> > > > > > +	 * flusher threads to pro-actively clean some pages
> > > > > > +	 */
> > > > > > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > > > 
> > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > > > number of dirty pages down to 0 whether or not pageout() is called.
> > > > > 
> > > > 
> > > > True, this has been fixed to only wakeup flusher threads when this is
> > > > the file LRU, dirty pages have been encountered and the caller has
> > > > sc->may_writepage.
> > > 
> > > OK.
> > > 
> > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > > > for efficiency.
> > > > > And it seems good to let the flusher write much more
> > > > > than nr_dirty pages to safeguard a reasonable large
> > > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > > > update the comments.
> > > > > 
> > > > 
> > > > Ok, the reasoning had been to flush a number of pages that was related
> > > > to the scanning rate but if that is inefficient for the flusher, I'll
> > > > use MAX_WRITEBACK_PAGES.
> > > 
> > > It would be better to pass something like (nr_dirty * N).
> > > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> > > obviously too large as a parameter. When the batch size is increased
> > > to 128MB, the writeback code may be improved somehow to not exceed the
> > > nr_pages limit too much.
> > > 
> > 
> > What might be a useful value for N? 1.5 appears to work reasonably well
> > to create a window of writeback ahead of the scanner but it's a bit
> > arbitrary.
> 
> I'd recommend N to be a large value. It's no longer relevant now since
> we'll call the flusher to sync some range containing the target page.
> The flusher will then choose an N large enough (eg. 4MB) for efficient
> IO. It needs to be a large value, otherwise the vmscan code will
> quickly run into dirty pages again..
> 

Ok, I took the 4MB at face value to be a "reasonable amount that should
not cause congestion". The end result is

#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
static inline long nr_writeback_pages(unsigned long nr_dirty)
{
        return laptop_mode ? 0 :
                        min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
}

nr_writeback_pages(nr_dirty) is what gets passed to
wakeup_flusher_threads(). Does that seem sensible?


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-27 13:35               ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-27 13:35 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, Jul 26, 2010 at 09:10:08PM +0800, Wu Fengguang wrote:
> On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote:
> > On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > > > >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > > > >  	}
> > > > > >  
> > > > > > +	/*
> > > > > > +	 * If reclaim is encountering dirty pages, it may be because
> > > > > > +	 * dirty pages are reaching the end of the LRU even though
> > > > > > +	 * the dirty_ratio may be satisified. In this case, wake
> > > > > > +	 * flusher threads to pro-actively clean some pages
> > > > > > +	 */
> > > > > > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > > > 
> > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > > > number of dirty pages down to 0 whether or not pageout() is called.
> > > > > 
> > > > 
> > > > True, this has been fixed to only wakeup flusher threads when this is
> > > > the file LRU, dirty pages have been encountered and the caller has
> > > > sc->may_writepage.
> > > 
> > > OK.
> > > 
> > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > > > for efficiency.
> > > > > And it seems good to let the flusher write much more
> > > > > than nr_dirty pages to safeguard a reasonable large
> > > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > > > update the comments.
> > > > > 
> > > > 
> > > > Ok, the reasoning had been to flush a number of pages that was related
> > > > to the scanning rate but if that is inefficient for the flusher, I'll
> > > > use MAX_WRITEBACK_PAGES.
> > > 
> > > It would be better to pass something like (nr_dirty * N).
> > > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> > > obviously too large as a parameter. When the batch size is increased
> > > to 128MB, the writeback code may be improved somehow to not exceed the
> > > nr_pages limit too much.
> > > 
> > 
> > What might be a useful value for N? 1.5 appears to work reasonably well
> > to create a window of writeback ahead of the scanner but it's a bit
> > arbitrary.
> 
> I'd recommend N to be a large value. It's no longer relevant now since
> we'll call the flusher to sync some range containing the target page.
> The flusher will then choose an N large enough (eg. 4MB) for efficient
> IO. It needs to be a large value, otherwise the vmscan code will
> quickly run into dirty pages again..
> 

Ok, I took the 4MB at face value to be a "reasonable amount that should
not cause congestion". The end result is

#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
static inline long nr_writeback_pages(unsigned long nr_dirty)
{
        return laptop_mode ? 0 :
                        min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
}

nr_writeback_pages(nr_dirty) is what gets passed to
wakeup_flusher_threads(). Does that seem sensible?


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-27 13:35               ` Mel Gorman
@ 2010-07-27 14:24                 ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-27 14:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jul 27, 2010 at 09:35:13PM +0800, Mel Gorman wrote:
> On Mon, Jul 26, 2010 at 09:10:08PM +0800, Wu Fengguang wrote:
> > On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote:
> > > On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > > > > >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > > > > >  	}
> > > > > > >  
> > > > > > > +	/*
> > > > > > > +	 * If reclaim is encountering dirty pages, it may be because
> > > > > > > +	 * dirty pages are reaching the end of the LRU even though
> > > > > > > +	 * the dirty_ratio may be satisified. In this case, wake
> > > > > > > +	 * flusher threads to pro-actively clean some pages
> > > > > > > +	 */
> > > > > > > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > > > > 
> > > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > > > > number of dirty pages down to 0 whether or not pageout() is called.
> > > > > > 
> > > > > 
> > > > > True, this has been fixed to only wakeup flusher threads when this is
> > > > > the file LRU, dirty pages have been encountered and the caller has
> > > > > sc->may_writepage.
> > > > 
> > > > OK.
> > > > 
> > > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > > > > for efficiency.
> > > > > > And it seems good to let the flusher write much more
> > > > > > than nr_dirty pages to safeguard a reasonable large
> > > > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > > > > update the comments.
> > > > > > 
> > > > > 
> > > > > Ok, the reasoning had been to flush a number of pages that was related
> > > > > to the scanning rate but if that is inefficient for the flusher, I'll
> > > > > use MAX_WRITEBACK_PAGES.
> > > > 
> > > > It would be better to pass something like (nr_dirty * N).
> > > > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> > > > obviously too large as a parameter. When the batch size is increased
> > > > to 128MB, the writeback code may be improved somehow to not exceed the
> > > > nr_pages limit too much.
> > > > 
> > > 
> > > What might be a useful value for N? 1.5 appears to work reasonably well
> > > to create a window of writeback ahead of the scanner but it's a bit
> > > arbitrary.
> > 
> > I'd recommend N to be a large value. It's no longer relevant now since
> > we'll call the flusher to sync some range containing the target page.
> > The flusher will then choose an N large enough (eg. 4MB) for efficient
> > IO. It needs to be a large value, otherwise the vmscan code will
> > quickly run into dirty pages again..
> > 
> 
> Ok, I took the 4MB at face value to be a "reasonable amount that should
> not cause congestion".

Under memory pressure, the disk should be busy/congested anyway.
The big 4MB adds much work, however many of the pages may need to be
synced in the near future anyway. It also requires more time to do
the bigger IO, hence adding some latency, however the latency should
be a small factor comparing to the IO queue time (which will be long
for a busy disk).

Overall expectation is, the more efficient IO, the more progress :)

> The end result is
> 
> #define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> #define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> static inline long nr_writeback_pages(unsigned long nr_dirty)
> {
>         return laptop_mode ? 0 :
>                         min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> }
> 
> nr_writeback_pages(nr_dirty) is what gets passed to
> wakeup_flusher_threads(). Does that seem sensible?

If you plan to keep wakeup_flusher_threads(), a simpler form may be
sufficient, eg.

        laptop_mode ? 0 : (nr_dirty * 16)

On top of this, we may write another patch to convert the
wakeup_flusher_threads(bdi, nr_pages) call to some
bdi_start_inode_writeback(inode, offset) call, to start more oriented
writeback.

When talking the 4MB optimization, I was referring to the internal
implementation of bdi_start_inode_writeback(). Sorry for the missing
context in the previous email.

It may need a big patch to implement bdi_start_inode_writeback().
Would you like to try it, or leave the task to me?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-27 14:24                 ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-27 14:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jul 27, 2010 at 09:35:13PM +0800, Mel Gorman wrote:
> On Mon, Jul 26, 2010 at 09:10:08PM +0800, Wu Fengguang wrote:
> > On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote:
> > > On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > > > > >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > > > > >  	}
> > > > > > >  
> > > > > > > +	/*
> > > > > > > +	 * If reclaim is encountering dirty pages, it may be because
> > > > > > > +	 * dirty pages are reaching the end of the LRU even though
> > > > > > > +	 * the dirty_ratio may be satisified. In this case, wake
> > > > > > > +	 * flusher threads to pro-actively clean some pages
> > > > > > > +	 */
> > > > > > > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > > > > 
> > > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > > > > number of dirty pages down to 0 whether or not pageout() is called.
> > > > > > 
> > > > > 
> > > > > True, this has been fixed to only wakeup flusher threads when this is
> > > > > the file LRU, dirty pages have been encountered and the caller has
> > > > > sc->may_writepage.
> > > > 
> > > > OK.
> > > > 
> > > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > > > > for efficiency.
> > > > > > And it seems good to let the flusher write much more
> > > > > > than nr_dirty pages to safeguard a reasonable large
> > > > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > > > > update the comments.
> > > > > > 
> > > > > 
> > > > > Ok, the reasoning had been to flush a number of pages that was related
> > > > > to the scanning rate but if that is inefficient for the flusher, I'll
> > > > > use MAX_WRITEBACK_PAGES.
> > > > 
> > > > It would be better to pass something like (nr_dirty * N).
> > > > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> > > > obviously too large as a parameter. When the batch size is increased
> > > > to 128MB, the writeback code may be improved somehow to not exceed the
> > > > nr_pages limit too much.
> > > > 
> > > 
> > > What might be a useful value for N? 1.5 appears to work reasonably well
> > > to create a window of writeback ahead of the scanner but it's a bit
> > > arbitrary.
> > 
> > I'd recommend N to be a large value. It's no longer relevant now since
> > we'll call the flusher to sync some range containing the target page.
> > The flusher will then choose an N large enough (eg. 4MB) for efficient
> > IO. It needs to be a large value, otherwise the vmscan code will
> > quickly run into dirty pages again..
> > 
> 
> Ok, I took the 4MB at face value to be a "reasonable amount that should
> not cause congestion".

Under memory pressure, the disk should be busy/congested anyway.
The big 4MB adds much work, however many of the pages may need to be
synced in the near future anyway. It also requires more time to do
the bigger IO, hence adding some latency, however the latency should
be a small factor comparing to the IO queue time (which will be long
for a busy disk).

Overall expectation is, the more efficient IO, the more progress :)

> The end result is
> 
> #define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> #define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> static inline long nr_writeback_pages(unsigned long nr_dirty)
> {
>         return laptop_mode ? 0 :
>                         min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> }
> 
> nr_writeback_pages(nr_dirty) is what gets passed to
> wakeup_flusher_threads(). Does that seem sensible?

If you plan to keep wakeup_flusher_threads(), a simpler form may be
sufficient, eg.

        laptop_mode ? 0 : (nr_dirty * 16)

On top of this, we may write another patch to convert the
wakeup_flusher_threads(bdi, nr_pages) call to some
bdi_start_inode_writeback(inode, offset) call, to start more oriented
writeback.

When talking the 4MB optimization, I was referring to the internal
implementation of bdi_start_inode_writeback(). Sorry for the missing
context in the previous email.

It may need a big patch to implement bdi_start_inode_writeback().
Would you like to try it, or leave the task to me?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-27 14:24                 ` Wu Fengguang
@ 2010-07-27 14:34                   ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-27 14:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

> If you plan to keep wakeup_flusher_threads(), a simpler form may be
> sufficient, eg.
> 
>         laptop_mode ? 0 : (nr_dirty * 16)

This number is not sensitive because the writeback code may well round
it up to some more IO efficient value (currently 4MB). AFAIK the
nr_pages parameters passed by all existing flusher callers are some
rule-of-thumb value, and far from being an exact number.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-27 14:34                   ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-27 14:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

> If you plan to keep wakeup_flusher_threads(), a simpler form may be
> sufficient, eg.
> 
>         laptop_mode ? 0 : (nr_dirty * 16)

This number is not sensitive because the writeback code may well round
it up to some more IO efficient value (currently 4MB). AFAIK the
nr_pages parameters passed by all existing flusher callers are some
rule-of-thumb value, and far from being an exact number.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-27 14:24                 ` Wu Fengguang
@ 2010-07-27 14:38                   ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-27 14:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jul 27, 2010 at 10:24:13PM +0800, Wu Fengguang wrote:
> On Tue, Jul 27, 2010 at 09:35:13PM +0800, Mel Gorman wrote:
> > On Mon, Jul 26, 2010 at 09:10:08PM +0800, Wu Fengguang wrote:
> > > On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote:
> > > > On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > > > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > > > > > >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > > > > > >  	}
> > > > > > > >  
> > > > > > > > +	/*
> > > > > > > > +	 * If reclaim is encountering dirty pages, it may be because
> > > > > > > > +	 * dirty pages are reaching the end of the LRU even though
> > > > > > > > +	 * the dirty_ratio may be satisified. In this case, wake
> > > > > > > > +	 * flusher threads to pro-actively clean some pages
> > > > > > > > +	 */
> > > > > > > > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > > > > > 
> > > > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > > > > > number of dirty pages down to 0 whether or not pageout() is called.
> > > > > > > 
> > > > > > 
> > > > > > True, this has been fixed to only wakeup flusher threads when this is
> > > > > > the file LRU, dirty pages have been encountered and the caller has
> > > > > > sc->may_writepage.
> > > > > 
> > > > > OK.
> > > > > 
> > > > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > > > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > > > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > > > > > for efficiency.
> > > > > > > And it seems good to let the flusher write much more
> > > > > > > than nr_dirty pages to safeguard a reasonable large
> > > > > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > > > > > update the comments.
> > > > > > > 
> > > > > > 
> > > > > > Ok, the reasoning had been to flush a number of pages that was related
> > > > > > to the scanning rate but if that is inefficient for the flusher, I'll
> > > > > > use MAX_WRITEBACK_PAGES.
> > > > > 
> > > > > It would be better to pass something like (nr_dirty * N).
> > > > > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> > > > > obviously too large as a parameter. When the batch size is increased
> > > > > to 128MB, the writeback code may be improved somehow to not exceed the
> > > > > nr_pages limit too much.
> > > > > 
> > > > 
> > > > What might be a useful value for N? 1.5 appears to work reasonably well
> > > > to create a window of writeback ahead of the scanner but it's a bit
> > > > arbitrary.
> > > 
> > > I'd recommend N to be a large value. It's no longer relevant now since
> > > we'll call the flusher to sync some range containing the target page.
> > > The flusher will then choose an N large enough (eg. 4MB) for efficient
> > > IO. It needs to be a large value, otherwise the vmscan code will
> > > quickly run into dirty pages again..
> > > 
> > 
> > Ok, I took the 4MB at face value to be a "reasonable amount that should
> > not cause congestion".
> 
> Under memory pressure, the disk should be busy/congested anyway.

Not necessarily. It could be streaming reads where pages are being added
to the LRU quickly but not necessarily dominated by dirty pages. Due to the
scanning rate, a dirty page may be encountered but it could be rare.

> The big 4MB adds much work, however many of the pages may need to be
> synced in the near future anyway. It also requires more time to do
> the bigger IO, hence adding some latency, however the latency should
> be a small factor comparing to the IO queue time (which will be long
> for a busy disk).
> 
> Overall expectation is, the more efficient IO, the more progress :)
> 

Ok.

> > The end result is
> > 
> > #define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> > #define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> > static inline long nr_writeback_pages(unsigned long nr_dirty)
> > {
> >         return laptop_mode ? 0 :
> >                         min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> > }
> > 
> > nr_writeback_pages(nr_dirty) is what gets passed to
> > wakeup_flusher_threads(). Does that seem sensible?
> 
> If you plan to keep wakeup_flusher_threads(), a simpler form may be
> sufficient, eg.
> 
>         laptop_mode ? 0 : (nr_dirty * 16)
> 

I plan to keep wakeup_flusher_threads() for now. I didn't go with 16 because
while nr_dirty will usually be < SWAP_CLUSTER_MAX, it might not be due to lumpy
reclaim. I wanted to firmly bound how much writeback was being requested -
hence the mild complexity.

> On top of this, we may write another patch to convert the
> wakeup_flusher_threads(bdi, nr_pages) call to some
> bdi_start_inode_writeback(inode, offset) call, to start more oriented
> writeback.
> 

I did a first pass at optimising based on prioritising inodes related to
dirty pages. It's incredibly primitive and I have to sit down and see
how the entire of writeback is put together to improve on it. Maybe
you'll spot something simple or see if it's the totally wrong direction.
Patch is below.

> When talking the 4MB optimization, I was referring to the internal
> implementation of bdi_start_inode_writeback(). Sorry for the missing
> context in the previous email.
> 

No worries, I was assuming it was something in mainline I didn't know
yet :)

> It may need a big patch to implement bdi_start_inode_writeback().
> Would you like to try it, or leave the task to me?
> 

If you send me a patch, I can try it out but it's not my highest
priority right now. I'm still looking to get writeback-from-reclaim down
to a reasonable level without causing a large amount of churn.

Here is the first pass anyway at kicking wakeup_flusher_threads() for
inodes belonging to a list of pages. You'll note that I do nothing with
page offset because I didn't spot a simple way of taking that
information into account. It's also horrible from a locking perspective.
So far, it's testing has been "it didn't crash".

==== CUT HERE ====
writeback: Prioritise dirty inodes encountered by reclaim for background flushing

It is preferable that as few dirty pages as possible are dispatched for
cleaning from the page reclaim path. When dirty pages are encountered by
page reclaim, this patch marks the inodes that they should be dispatched
immediately. When the background flusher runs, it moves such inodes immediately
to the dispatch queue regardless of inode age.

This is an early prototype. It could be optimised to not regularly take
the inode lock repeatedly and ideally the page offset would also be
taken into account.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 fs/fs-writeback.c         |   52 ++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/fs.h        |    5 ++-
 include/linux/writeback.h |    1 +
 mm/vmscan.c               |    6 +++-
 4 files changed, 59 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5a3c764..27a8b75 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -221,7 +221,7 @@ static void move_expired_inodes(struct list_head *delaying_queue,
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
-	struct inode *inode;
+	struct inode *inode, *tinode;
 	int do_sb_sort = 0;
 
 	if (wbc->for_kupdate || wbc->for_background) {
@@ -229,6 +229,14 @@ static void move_expired_inodes(struct list_head *delaying_queue,
 		older_than_this = jiffies - expire_interval;
 	}
 
+	/* Move inodes reclaim found at end of LRU to dispatch queue */
+	list_for_each_entry_safe(inode, tinode, delaying_queue, i_list) {
+		if (inode->i_state & I_DIRTY_RECLAIM) {
+			inode->i_state &= ~I_DIRTY_RECLAIM;
+			list_move(&inode->i_list, &tmp);
+		}
+	}
+
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
 		if (expire_interval &&
@@ -906,6 +914,48 @@ void wakeup_flusher_threads(long nr_pages)
 	rcu_read_unlock();
 }
 
+/*
+ * Similar to wakeup_flusher_threads except prioritise inodes contained
+ * in the page_list regardless of age
+ */
+void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list)
+{
+	struct page *page;
+	struct address_space *mapping;
+	struct inode *inode;
+
+	list_for_each_entry(page, page_list, lru) {
+		if (!PageDirty(page))
+			continue;
+
+		lock_page(page);
+		mapping = page_mapping(page);
+		if (!mapping || mapping == &swapper_space)
+			goto unlock;
+
+		/*
+		 * Test outside the lock to see as if it is already set, taking
+		 * the inode lock is a waste and the inode should be pinned by
+		 * the lock_page
+		 */
+		inode = page->mapping->host;
+		if (inode->i_state & I_DIRTY_RECLAIM)
+			goto unlock;
+
+		/*
+		 * XXX: Yuck, has to be a way of batching this by not requiring
+		 * 	the page lock to pin the inode
+		 */
+		spin_lock(&inode_lock);
+		inode->i_state |= I_DIRTY_RECLAIM;
+		spin_unlock(&inode_lock);
+unlock:
+		unlock_page(page);
+	}
+
+	wakeup_flusher_threads(nr_pages);
+}
+
 static noinline void block_dump___mark_inode_dirty(struct inode *inode)
 {
 	if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e29f0ed..8836698 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1585,8 +1585,8 @@ struct super_operations {
 /*
  * Inode state bits.  Protected by inode_lock.
  *
- * Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
- * I_DIRTY_DATASYNC and I_DIRTY_PAGES.
+ * Four bits determine the dirty state of the inode, I_DIRTY_SYNC,
+ * I_DIRTY_DATASYNC, I_DIRTY_PAGES and I_DIRTY_RECLAIM.
  *
  * Four bits define the lifetime of an inode.  Initially, inodes are I_NEW,
  * until that flag is cleared.  I_WILL_FREE, I_FREEING and I_CLEAR are set at
@@ -1633,6 +1633,7 @@ struct super_operations {
 #define I_DIRTY_SYNC		1
 #define I_DIRTY_DATASYNC	2
 #define I_DIRTY_PAGES		4
+#define I_DIRTY_RECLAIM		256
 #define __I_NEW			3
 #define I_NEW			(1 << __I_NEW)
 #define I_WILL_FREE		16
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 494edd6..73a4df2 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -64,6 +64,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 		struct writeback_control *wbc);
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
 void wakeup_flusher_threads(long nr_pages);
+void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list);
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b66d1f5..bad1abf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -901,7 +901,8 @@ keep:
 	 * laptop mode avoiding disk spin-ups
 	 */
 	if (file && nr_dirty_seen && sc->may_writepage)
-		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
+		wakeup_flusher_threads_pages(nr_writeback_pages(nr_dirty),
+					page_list);
 
 	*nr_still_dirty = nr_dirty;
 	count_vm_events(PGACTIVATE, pgactivate);
@@ -1368,7 +1369,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 				list_add(&page->lru, &putback_list);
 			}
 
-			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
+			wakeup_flusher_threads_pages(laptop_mode ? 0 : nr_dirty,
+								&page_list);
 			congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 			/*

^ permalink raw reply related	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-27 14:38                   ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-27 14:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jul 27, 2010 at 10:24:13PM +0800, Wu Fengguang wrote:
> On Tue, Jul 27, 2010 at 09:35:13PM +0800, Mel Gorman wrote:
> > On Mon, Jul 26, 2010 at 09:10:08PM +0800, Wu Fengguang wrote:
> > > On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote:
> > > > On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > > > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > > > > > >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > > > > > >  	}
> > > > > > > >  
> > > > > > > > +	/*
> > > > > > > > +	 * If reclaim is encountering dirty pages, it may be because
> > > > > > > > +	 * dirty pages are reaching the end of the LRU even though
> > > > > > > > +	 * the dirty_ratio may be satisified. In this case, wake
> > > > > > > > +	 * flusher threads to pro-actively clean some pages
> > > > > > > > +	 */
> > > > > > > > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > > > > > 
> > > > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > > > > > number of dirty pages down to 0 whether or not pageout() is called.
> > > > > > > 
> > > > > > 
> > > > > > True, this has been fixed to only wakeup flusher threads when this is
> > > > > > the file LRU, dirty pages have been encountered and the caller has
> > > > > > sc->may_writepage.
> > > > > 
> > > > > OK.
> > > > > 
> > > > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > > > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > > > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > > > > > for efficiency.
> > > > > > > And it seems good to let the flusher write much more
> > > > > > > than nr_dirty pages to safeguard a reasonable large
> > > > > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > > > > > update the comments.
> > > > > > > 
> > > > > > 
> > > > > > Ok, the reasoning had been to flush a number of pages that was related
> > > > > > to the scanning rate but if that is inefficient for the flusher, I'll
> > > > > > use MAX_WRITEBACK_PAGES.
> > > > > 
> > > > > It would be better to pass something like (nr_dirty * N).
> > > > > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> > > > > obviously too large as a parameter. When the batch size is increased
> > > > > to 128MB, the writeback code may be improved somehow to not exceed the
> > > > > nr_pages limit too much.
> > > > > 
> > > > 
> > > > What might be a useful value for N? 1.5 appears to work reasonably well
> > > > to create a window of writeback ahead of the scanner but it's a bit
> > > > arbitrary.
> > > 
> > > I'd recommend N to be a large value. It's no longer relevant now since
> > > we'll call the flusher to sync some range containing the target page.
> > > The flusher will then choose an N large enough (eg. 4MB) for efficient
> > > IO. It needs to be a large value, otherwise the vmscan code will
> > > quickly run into dirty pages again..
> > > 
> > 
> > Ok, I took the 4MB at face value to be a "reasonable amount that should
> > not cause congestion".
> 
> Under memory pressure, the disk should be busy/congested anyway.

Not necessarily. It could be streaming reads where pages are being added
to the LRU quickly but not necessarily dominated by dirty pages. Due to the
scanning rate, a dirty page may be encountered but it could be rare.

> The big 4MB adds much work, however many of the pages may need to be
> synced in the near future anyway. It also requires more time to do
> the bigger IO, hence adding some latency, however the latency should
> be a small factor comparing to the IO queue time (which will be long
> for a busy disk).
> 
> Overall expectation is, the more efficient IO, the more progress :)
> 

Ok.

> > The end result is
> > 
> > #define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> > #define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> > static inline long nr_writeback_pages(unsigned long nr_dirty)
> > {
> >         return laptop_mode ? 0 :
> >                         min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> > }
> > 
> > nr_writeback_pages(nr_dirty) is what gets passed to
> > wakeup_flusher_threads(). Does that seem sensible?
> 
> If you plan to keep wakeup_flusher_threads(), a simpler form may be
> sufficient, eg.
> 
>         laptop_mode ? 0 : (nr_dirty * 16)
> 

I plan to keep wakeup_flusher_threads() for now. I didn't go with 16 because
while nr_dirty will usually be < SWAP_CLUSTER_MAX, it might not be due to lumpy
reclaim. I wanted to firmly bound how much writeback was being requested -
hence the mild complexity.

> On top of this, we may write another patch to convert the
> wakeup_flusher_threads(bdi, nr_pages) call to some
> bdi_start_inode_writeback(inode, offset) call, to start more oriented
> writeback.
> 

I did a first pass at optimising based on prioritising inodes related to
dirty pages. It's incredibly primitive and I have to sit down and see
how the entire of writeback is put together to improve on it. Maybe
you'll spot something simple or see if it's the totally wrong direction.
Patch is below.

> When talking the 4MB optimization, I was referring to the internal
> implementation of bdi_start_inode_writeback(). Sorry for the missing
> context in the previous email.
> 

No worries, I was assuming it was something in mainline I didn't know
yet :)

> It may need a big patch to implement bdi_start_inode_writeback().
> Would you like to try it, or leave the task to me?
> 

If you send me a patch, I can try it out but it's not my highest
priority right now. I'm still looking to get writeback-from-reclaim down
to a reasonable level without causing a large amount of churn.

Here is the first pass anyway at kicking wakeup_flusher_threads() for
inodes belonging to a list of pages. You'll note that I do nothing with
page offset because I didn't spot a simple way of taking that
information into account. It's also horrible from a locking perspective.
So far, it's testing has been "it didn't crash".

==== CUT HERE ====
writeback: Prioritise dirty inodes encountered by reclaim for background flushing

It is preferable that as few dirty pages as possible are dispatched for
cleaning from the page reclaim path. When dirty pages are encountered by
page reclaim, this patch marks the inodes that they should be dispatched
immediately. When the background flusher runs, it moves such inodes immediately
to the dispatch queue regardless of inode age.

This is an early prototype. It could be optimised to not regularly take
the inode lock repeatedly and ideally the page offset would also be
taken into account.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 fs/fs-writeback.c         |   52 ++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/fs.h        |    5 ++-
 include/linux/writeback.h |    1 +
 mm/vmscan.c               |    6 +++-
 4 files changed, 59 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5a3c764..27a8b75 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -221,7 +221,7 @@ static void move_expired_inodes(struct list_head *delaying_queue,
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
-	struct inode *inode;
+	struct inode *inode, *tinode;
 	int do_sb_sort = 0;
 
 	if (wbc->for_kupdate || wbc->for_background) {
@@ -229,6 +229,14 @@ static void move_expired_inodes(struct list_head *delaying_queue,
 		older_than_this = jiffies - expire_interval;
 	}
 
+	/* Move inodes reclaim found at end of LRU to dispatch queue */
+	list_for_each_entry_safe(inode, tinode, delaying_queue, i_list) {
+		if (inode->i_state & I_DIRTY_RECLAIM) {
+			inode->i_state &= ~I_DIRTY_RECLAIM;
+			list_move(&inode->i_list, &tmp);
+		}
+	}
+
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
 		if (expire_interval &&
@@ -906,6 +914,48 @@ void wakeup_flusher_threads(long nr_pages)
 	rcu_read_unlock();
 }
 
+/*
+ * Similar to wakeup_flusher_threads except prioritise inodes contained
+ * in the page_list regardless of age
+ */
+void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list)
+{
+	struct page *page;
+	struct address_space *mapping;
+	struct inode *inode;
+
+	list_for_each_entry(page, page_list, lru) {
+		if (!PageDirty(page))
+			continue;
+
+		lock_page(page);
+		mapping = page_mapping(page);
+		if (!mapping || mapping == &swapper_space)
+			goto unlock;
+
+		/*
+		 * Test outside the lock to see as if it is already set, taking
+		 * the inode lock is a waste and the inode should be pinned by
+		 * the lock_page
+		 */
+		inode = page->mapping->host;
+		if (inode->i_state & I_DIRTY_RECLAIM)
+			goto unlock;
+
+		/*
+		 * XXX: Yuck, has to be a way of batching this by not requiring
+		 * 	the page lock to pin the inode
+		 */
+		spin_lock(&inode_lock);
+		inode->i_state |= I_DIRTY_RECLAIM;
+		spin_unlock(&inode_lock);
+unlock:
+		unlock_page(page);
+	}
+
+	wakeup_flusher_threads(nr_pages);
+}
+
 static noinline void block_dump___mark_inode_dirty(struct inode *inode)
 {
 	if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e29f0ed..8836698 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1585,8 +1585,8 @@ struct super_operations {
 /*
  * Inode state bits.  Protected by inode_lock.
  *
- * Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
- * I_DIRTY_DATASYNC and I_DIRTY_PAGES.
+ * Four bits determine the dirty state of the inode, I_DIRTY_SYNC,
+ * I_DIRTY_DATASYNC, I_DIRTY_PAGES and I_DIRTY_RECLAIM.
  *
  * Four bits define the lifetime of an inode.  Initially, inodes are I_NEW,
  * until that flag is cleared.  I_WILL_FREE, I_FREEING and I_CLEAR are set at
@@ -1633,6 +1633,7 @@ struct super_operations {
 #define I_DIRTY_SYNC		1
 #define I_DIRTY_DATASYNC	2
 #define I_DIRTY_PAGES		4
+#define I_DIRTY_RECLAIM		256
 #define __I_NEW			3
 #define I_NEW			(1 << __I_NEW)
 #define I_WILL_FREE		16
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 494edd6..73a4df2 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -64,6 +64,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 		struct writeback_control *wbc);
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
 void wakeup_flusher_threads(long nr_pages);
+void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list);
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b66d1f5..bad1abf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -901,7 +901,8 @@ keep:
 	 * laptop mode avoiding disk spin-ups
 	 */
 	if (file && nr_dirty_seen && sc->may_writepage)
-		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
+		wakeup_flusher_threads_pages(nr_writeback_pages(nr_dirty),
+					page_list);
 
 	*nr_still_dirty = nr_dirty;
 	count_vm_events(PGACTIVATE, pgactivate);
@@ -1368,7 +1369,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 				list_add(&page->lru, &putback_list);
 			}
 
-			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
+			wakeup_flusher_threads_pages(laptop_mode ? 0 : nr_dirty,
+								&page_list);
 			congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 			/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-27 14:34                   ` Wu Fengguang
@ 2010-07-27 14:40                     ` Mel Gorman
  -1 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-27 14:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jul 27, 2010 at 10:34:23PM +0800, Wu Fengguang wrote:
> > If you plan to keep wakeup_flusher_threads(), a simpler form may be
> > sufficient, eg.
> > 
> >         laptop_mode ? 0 : (nr_dirty * 16)
> 
> This number is not sensitive because the writeback code may well round
> it up to some more IO efficient value (currently 4MB). AFAIK the
> nr_pages parameters passed by all existing flusher callers are some
> rule-of-thumb value, and far from being an exact number.
> 

I get that it's a rule of thumb but decided I would still pass in some value
related to nr_dirty that was bounded in some manner.  Currently, that bound
is 4MB but maybe it should have been bound to MAX_WRITEBACK_PAGES (which is
4MB for x86, but could be anything depending on the base page size).

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-27 14:40                     ` Mel Gorman
  0 siblings, 0 replies; 177+ messages in thread
From: Mel Gorman @ 2010-07-27 14:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jul 27, 2010 at 10:34:23PM +0800, Wu Fengguang wrote:
> > If you plan to keep wakeup_flusher_threads(), a simpler form may be
> > sufficient, eg.
> > 
> >         laptop_mode ? 0 : (nr_dirty * 16)
> 
> This number is not sensitive because the writeback code may well round
> it up to some more IO efficient value (currently 4MB). AFAIK the
> nr_pages parameters passed by all existing flusher callers are some
> rule-of-thumb value, and far from being an exact number.
> 

I get that it's a rule of thumb but decided I would still pass in some value
related to nr_dirty that was bounded in some manner.  Currently, that bound
is 4MB but maybe it should have been bound to MAX_WRITEBACK_PAGES (which is
4MB for x86, but could be anything depending on the base page size).

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-27 14:40                     ` Mel Gorman
@ 2010-07-27 14:55                       ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-27 14:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jul 27, 2010 at 10:40:26PM +0800, Mel Gorman wrote:
> On Tue, Jul 27, 2010 at 10:34:23PM +0800, Wu Fengguang wrote:
> > > If you plan to keep wakeup_flusher_threads(), a simpler form may be
> > > sufficient, eg.
> > > 
> > >         laptop_mode ? 0 : (nr_dirty * 16)
> > 
> > This number is not sensitive because the writeback code may well round
> > it up to some more IO efficient value (currently 4MB). AFAIK the
> > nr_pages parameters passed by all existing flusher callers are some
> > rule-of-thumb value, and far from being an exact number.
> > 
> 
> I get that it's a rule of thumb but decided I would still pass in some value
> related to nr_dirty that was bounded in some manner.
> Currently, that bound is 4MB but maybe it should have been bound to
> MAX_WRITEBACK_PAGES (which is 4MB for x86, but could be anything
> depending on the base page size).

I see your worry about much bigger page size making

        vmscan batch size > writeback batch size

and it's a legitimate worry.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-27 14:55                       ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-27 14:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jul 27, 2010 at 10:40:26PM +0800, Mel Gorman wrote:
> On Tue, Jul 27, 2010 at 10:34:23PM +0800, Wu Fengguang wrote:
> > > If you plan to keep wakeup_flusher_threads(), a simpler form may be
> > > sufficient, eg.
> > > 
> > >         laptop_mode ? 0 : (nr_dirty * 16)
> > 
> > This number is not sensitive because the writeback code may well round
> > it up to some more IO efficient value (currently 4MB). AFAIK the
> > nr_pages parameters passed by all existing flusher callers are some
> > rule-of-thumb value, and far from being an exact number.
> > 
> 
> I get that it's a rule of thumb but decided I would still pass in some value
> related to nr_dirty that was bounded in some manner.
> Currently, that bound is 4MB but maybe it should have been bound to
> MAX_WRITEBACK_PAGES (which is 4MB for x86, but could be anything
> depending on the base page size).

I see your worry about much bigger page size making

        vmscan batch size > writeback batch size

and it's a legitimate worry.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-27 14:38                   ` Mel Gorman
@ 2010-07-27 15:21                     ` Wu Fengguang
  -1 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-27 15:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jul 27, 2010 at 10:38:05PM +0800, Mel Gorman wrote:
> On Tue, Jul 27, 2010 at 10:24:13PM +0800, Wu Fengguang wrote:
> > On Tue, Jul 27, 2010 at 09:35:13PM +0800, Mel Gorman wrote:
> > > On Mon, Jul 26, 2010 at 09:10:08PM +0800, Wu Fengguang wrote:
> > > > On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote:
> > > > > On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > > > > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > > > > > > >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > > > > > > >  	}
> > > > > > > > >  
> > > > > > > > > +	/*
> > > > > > > > > +	 * If reclaim is encountering dirty pages, it may be because
> > > > > > > > > +	 * dirty pages are reaching the end of the LRU even though
> > > > > > > > > +	 * the dirty_ratio may be satisified. In this case, wake
> > > > > > > > > +	 * flusher threads to pro-actively clean some pages
> > > > > > > > > +	 */
> > > > > > > > > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > > > > > > 
> > > > > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > > > > > > number of dirty pages down to 0 whether or not pageout() is called.
> > > > > > > > 
> > > > > > > 
> > > > > > > True, this has been fixed to only wakeup flusher threads when this is
> > > > > > > the file LRU, dirty pages have been encountered and the caller has
> > > > > > > sc->may_writepage.
> > > > > > 
> > > > > > OK.
> > > > > > 
> > > > > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > > > > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > > > > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > > > > > > for efficiency.
> > > > > > > > And it seems good to let the flusher write much more
> > > > > > > > than nr_dirty pages to safeguard a reasonable large
> > > > > > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > > > > > > update the comments.
> > > > > > > > 
> > > > > > > 
> > > > > > > Ok, the reasoning had been to flush a number of pages that was related
> > > > > > > to the scanning rate but if that is inefficient for the flusher, I'll
> > > > > > > use MAX_WRITEBACK_PAGES.
> > > > > > 
> > > > > > It would be better to pass something like (nr_dirty * N).
> > > > > > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> > > > > > obviously too large as a parameter. When the batch size is increased
> > > > > > to 128MB, the writeback code may be improved somehow to not exceed the
> > > > > > nr_pages limit too much.
> > > > > > 
> > > > > 
> > > > > What might be a useful value for N? 1.5 appears to work reasonably well
> > > > > to create a window of writeback ahead of the scanner but it's a bit
> > > > > arbitrary.
> > > > 
> > > > I'd recommend N to be a large value. It's no longer relevant now since
> > > > we'll call the flusher to sync some range containing the target page.
> > > > The flusher will then choose an N large enough (eg. 4MB) for efficient
> > > > IO. It needs to be a large value, otherwise the vmscan code will
> > > > quickly run into dirty pages again..
> > > > 
> > > 
> > > Ok, I took the 4MB at face value to be a "reasonable amount that should
> > > not cause congestion".
> > 
> > Under memory pressure, the disk should be busy/congested anyway.
> 
> Not necessarily. It could be streaming reads where pages are being added
> to the LRU quickly but not necessarily dominated by dirty pages. Due to the
> scanning rate, a dirty page may be encountered but it could be rare.

Right.

> > The big 4MB adds much work, however many of the pages may need to be
> > synced in the near future anyway. It also requires more time to do
> > the bigger IO, hence adding some latency, however the latency should
> > be a small factor comparing to the IO queue time (which will be long
> > for a busy disk).
> > 
> > Overall expectation is, the more efficient IO, the more progress :)
> > 
> 
> Ok.
> 
> > > The end result is
> > > 
> > > #define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> > > #define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> > > static inline long nr_writeback_pages(unsigned long nr_dirty)
> > > {
> > >         return laptop_mode ? 0 :
> > >                         min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> > > }
> > > 
> > > nr_writeback_pages(nr_dirty) is what gets passed to
> > > wakeup_flusher_threads(). Does that seem sensible?
> > 
> > If you plan to keep wakeup_flusher_threads(), a simpler form may be
> > sufficient, eg.
> > 
> >         laptop_mode ? 0 : (nr_dirty * 16)
> > 
> 
> I plan to keep wakeup_flusher_threads() for now. I didn't go with 16 because
> while nr_dirty will usually be < SWAP_CLUSTER_MAX, it might not be due to lumpy
> reclaim. I wanted to firmly bound how much writeback was being requested -
> hence the mild complexity.

OK.

> > On top of this, we may write another patch to convert the
> > wakeup_flusher_threads(bdi, nr_pages) call to some
> > bdi_start_inode_writeback(inode, offset) call, to start more oriented
> > writeback.
> > 
> 
> I did a first pass at optimising based on prioritising inodes related to
> dirty pages. It's incredibly primitive and I have to sit down and see
> how the entire of writeback is put together to improve on it. Maybe
> you'll spot something simple or see if it's the totally wrong direction.
> Patch is below.

The simplest style may be

        struct writeback_control wbc = {
                .sync_mode = WB_SYNC_NONE,
                .nr_to_write = MAX_WRITEBACK_PAGES,
        };
       
        mapping->writeback_index = offset;
        return do_writepages(mapping, &wbc);

But sure there will be many details to handle.

> > When talking the 4MB optimization, I was referring to the internal
> > implementation of bdi_start_inode_writeback(). Sorry for the missing
> > context in the previous email.
> > 
> 
> No worries, I was assuming it was something in mainline I didn't know
> yet :)
> 
> > It may need a big patch to implement bdi_start_inode_writeback().
> > Would you like to try it, or leave the task to me?
> > 
> 
> If you send me a patch, I can try it out but it's not my highest
> priority right now. I'm still looking to get writeback-from-reclaim down
> to a reasonable level without causing a large amount of churn.

OK. That's already a great work.
 
> Here is the first pass anyway at kicking wakeup_flusher_threads() for
> inodes belonging to a list of pages. You'll note that I do nothing with
> page offset because I didn't spot a simple way of taking that
> information into account. It's also horrible from a locking perspective.
> So far, it's testing has been "it didn't crash".

It seems a neat way to prioritize the inodes with a new flag
I_DIRTY_RECLAIM. However it may require vastly different
implementation when considering the offset. I'll try to work up a
prototype tomorrow.

Thanks,
Fengguang

 
> ==== CUT HERE ====
> writeback: Prioritise dirty inodes encountered by reclaim for background flushing
> 
> It is preferable that as few dirty pages as possible are dispatched for
> cleaning from the page reclaim path. When dirty pages are encountered by
> page reclaim, this patch marks the inodes that they should be dispatched
> immediately. When the background flusher runs, it moves such inodes immediately
> to the dispatch queue regardless of inode age.
> 
> This is an early prototype. It could be optimised to not regularly take
> the inode lock repeatedly and ideally the page offset would also be
> taken into account.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  fs/fs-writeback.c         |   52 ++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/fs.h        |    5 ++-
>  include/linux/writeback.h |    1 +
>  mm/vmscan.c               |    6 +++-
>  4 files changed, 59 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 5a3c764..27a8b75 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -221,7 +221,7 @@ static void move_expired_inodes(struct list_head *delaying_queue,
>  	LIST_HEAD(tmp);
>  	struct list_head *pos, *node;
>  	struct super_block *sb = NULL;
> -	struct inode *inode;
> +	struct inode *inode, *tinode;
>  	int do_sb_sort = 0;
>  
>  	if (wbc->for_kupdate || wbc->for_background) {
> @@ -229,6 +229,14 @@ static void move_expired_inodes(struct list_head *delaying_queue,
>  		older_than_this = jiffies - expire_interval;
>  	}
>  
> +	/* Move inodes reclaim found at end of LRU to dispatch queue */
> +	list_for_each_entry_safe(inode, tinode, delaying_queue, i_list) {
> +		if (inode->i_state & I_DIRTY_RECLAIM) {
> +			inode->i_state &= ~I_DIRTY_RECLAIM;
> +			list_move(&inode->i_list, &tmp);
> +		}
> +	}
> +
>  	while (!list_empty(delaying_queue)) {
>  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
>  		if (expire_interval &&
> @@ -906,6 +914,48 @@ void wakeup_flusher_threads(long nr_pages)
>  	rcu_read_unlock();
>  }
>  
> +/*
> + * Similar to wakeup_flusher_threads except prioritise inodes contained
> + * in the page_list regardless of age
> + */
> +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list)
> +{
> +	struct page *page;
> +	struct address_space *mapping;
> +	struct inode *inode;
> +
> +	list_for_each_entry(page, page_list, lru) {
> +		if (!PageDirty(page))
> +			continue;
> +
> +		lock_page(page);
> +		mapping = page_mapping(page);
> +		if (!mapping || mapping == &swapper_space)
> +			goto unlock;
> +
> +		/*
> +		 * Test outside the lock to see as if it is already set, taking
> +		 * the inode lock is a waste and the inode should be pinned by
> +		 * the lock_page
> +		 */
> +		inode = page->mapping->host;
> +		if (inode->i_state & I_DIRTY_RECLAIM)
> +			goto unlock;
> +
> +		/*
> +		 * XXX: Yuck, has to be a way of batching this by not requiring
> +		 * 	the page lock to pin the inode
> +		 */
> +		spin_lock(&inode_lock);
> +		inode->i_state |= I_DIRTY_RECLAIM;
> +		spin_unlock(&inode_lock);
> +unlock:
> +		unlock_page(page);
> +	}
> +
> +	wakeup_flusher_threads(nr_pages);
> +}
> +
>  static noinline void block_dump___mark_inode_dirty(struct inode *inode)
>  {
>  	if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e29f0ed..8836698 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1585,8 +1585,8 @@ struct super_operations {
>  /*
>   * Inode state bits.  Protected by inode_lock.
>   *
> - * Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
> - * I_DIRTY_DATASYNC and I_DIRTY_PAGES.
> + * Four bits determine the dirty state of the inode, I_DIRTY_SYNC,
> + * I_DIRTY_DATASYNC, I_DIRTY_PAGES and I_DIRTY_RECLAIM.
>   *
>   * Four bits define the lifetime of an inode.  Initially, inodes are I_NEW,
>   * until that flag is cleared.  I_WILL_FREE, I_FREEING and I_CLEAR are set at
> @@ -1633,6 +1633,7 @@ struct super_operations {
>  #define I_DIRTY_SYNC		1
>  #define I_DIRTY_DATASYNC	2
>  #define I_DIRTY_PAGES		4
> +#define I_DIRTY_RECLAIM		256
>  #define __I_NEW			3
>  #define I_NEW			(1 << __I_NEW)
>  #define I_WILL_FREE		16
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 494edd6..73a4df2 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -64,6 +64,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
>  		struct writeback_control *wbc);
>  long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
>  void wakeup_flusher_threads(long nr_pages);
> +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list);
>  
>  /* writeback.h requires fs.h; it, too, is not included from here. */
>  static inline void wait_on_inode(struct inode *inode)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b66d1f5..bad1abf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -901,7 +901,8 @@ keep:
>  	 * laptop mode avoiding disk spin-ups
>  	 */
>  	if (file && nr_dirty_seen && sc->may_writepage)
> -		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> +		wakeup_flusher_threads_pages(nr_writeback_pages(nr_dirty),
> +					page_list);
>  
>  	*nr_still_dirty = nr_dirty;
>  	count_vm_events(PGACTIVATE, pgactivate);
> @@ -1368,7 +1369,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  				list_add(&page->lru, &putback_list);
>  			}
>  
> -			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> +			wakeup_flusher_threads_pages(laptop_mode ? 0 : nr_dirty,
> +								&page_list);
>  			congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  			/*

^ permalink raw reply	[flat|nested] 177+ messages in thread

* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-27 15:21                     ` Wu Fengguang
  0 siblings, 0 replies; 177+ messages in thread
From: Wu Fengguang @ 2010-07-27 15:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jul 27, 2010 at 10:38:05PM +0800, Mel Gorman wrote:
> On Tue, Jul 27, 2010 at 10:24:13PM +0800, Wu Fengguang wrote:
> > On Tue, Jul 27, 2010 at 09:35:13PM +0800, Mel Gorman wrote:
> > > On Mon, Jul 26, 2010 at 09:10:08PM +0800, Wu Fengguang wrote:
> > > > On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote:
> > > > > On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > > > > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > > > > > > >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > > > > > > >  	}
> > > > > > > > >  
> > > > > > > > > +	/*
> > > > > > > > > +	 * If reclaim is encountering dirty pages, it may be because
> > > > > > > > > +	 * dirty pages are reaching the end of the LRU even though
> > > > > > > > > +	 * the dirty_ratio may be satisified. In this case, wake
> > > > > > > > > +	 * flusher threads to pro-actively clean some pages
> > > > > > > > > +	 */
> > > > > > > > > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > > > > > > 
> > > > > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > > > > > > number of dirty pages down to 0 whether or not pageout() is called.
> > > > > > > > 
> > > > > > > 
> > > > > > > True, this has been fixed to only wakeup flusher threads when this is
> > > > > > > the file LRU, dirty pages have been encountered and the caller has
> > > > > > > sc->may_writepage.
> > > > > > 
> > > > > > OK.
> > > > > > 
> > > > > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > > > > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > > > > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > > > > > > for efficiency.
> > > > > > > > And it seems good to let the flusher write much more
> > > > > > > > than nr_dirty pages to safeguard a reasonable large
> > > > > > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > > > > > > update the comments.
> > > > > > > > 
> > > > > > > 
> > > > > > > Ok, the reasoning had been to flush a number of pages that was related
> > > > > > > to the scanning rate but if that is inefficient for the flusher, I'll
> > > > > > > use MAX_WRITEBACK_PAGES.
> > > > > > 
> > > > > > It would be better to pass something like (nr_dirty * N).
> > > > > > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> > > > > > obviously too large as a parameter. When the batch size is increased
> > > > > > to 128MB, the writeback code may be improved somehow to not exceed the
> > > > > > nr_pages limit too much.
> > > > > > 
> > > > > 
> > > > > What might be a useful value for N? 1.5 appears to work reasonably well
> > > > > to create a window of writeback ahead of the scanner but it's a bit
> > > > > arbitrary.
> > > > 
> > > > I'd recommend N to be a large value. It's no longer relevant now since
> > > > we'll call the flusher to sync some range containing the target page.
> > > > The flusher will then choose an N large enough (eg. 4MB) for efficient
> > > > IO. It needs to be a large value, otherwise the vmscan code will
> > > > quickly run into dirty pages again..
> > > > 
> > > 
> > > Ok, I took the 4MB at face value to be a "reasonable amount that should
> > > not cause congestion".
> > 
> > Under memory pressure, the disk should be busy/congested anyway.
> 
> Not necessarily. It could be streaming reads where pages are being added
> to the LRU quickly but not necessarily dominated by dirty pages. Due to the
> scanning rate, a dirty page may be encountered but it could be rare.

Right.

> > The big 4MB adds much work, however many of the pages may need to be
> > synced in the near future anyway. It also requires more time to do
> > the bigger IO, hence adding some latency, however the latency should
> > be a small factor comparing to the IO queue time (which will be long
> > for a busy disk).
> > 
> > Overall expectation is, the more efficient IO, the more progress :)
> > 
> 
> Ok.
> 
> > > The end result is
> > > 
> > > #define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> > > #define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> > > static inline long nr_writeback_pages(unsigned long nr_dirty)
> > > {
> > >         return laptop_mode ? 0 :
> > >                         min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> > > }
> > > 
> > > nr_writeback_pages(nr_dirty) is what gets passed to
> > > wakeup_flusher_threads(). Does that seem sensible?
> > 
> > If you plan to keep wakeup_flusher_threads(), a simpler form may be
> > sufficient, eg.
> > 
> >         laptop_mode ? 0 : (nr_dirty * 16)
> > 
> 
> I plan to keep wakeup_flusher_threads() for now. I didn't go with 16 because
> while nr_dirty will usually be < SWAP_CLUSTER_MAX, it might not be due to lumpy
> reclaim. I wanted to firmly bound how much writeback was being requested -
> hence the mild complexity.

OK.

> > On top of this, we may write another patch to convert the
> > wakeup_flusher_threads(bdi, nr_pages) call to some
> > bdi_start_inode_writeback(inode, offset) call, to start more oriented
> > writeback.
> > 
> 
> I did a first pass at optimising based on prioritising inodes related to
> dirty pages. It's incredibly primitive and I have to sit down and see
> how the entire of writeback is put together to improve on it. Maybe
> you'll spot something simple or see if it's the totally wrong direction.
> Patch is below.

The simplest style may be

        struct writeback_control wbc = {
                .sync_mode = WB_SYNC_NONE,
                .nr_to_write = MAX_WRITEBACK_PAGES,
        };
       
        mapping->writeback_index = offset;
        return do_writepages(mapping, &wbc);

But sure there will be many details to handle.

> > When talking the 4MB optimization, I was referring to the internal
> > implementation of bdi_start_inode_writeback(). Sorry for the missing
> > context in the previous email.
> > 
> 
> No worries, I was assuming it was something in mainline I didn't know
> yet :)
> 
> > It may need a big patch to implement bdi_start_inode_writeback().
> > Would you like to try it, or leave the task to me?
> > 
> 
> If you send me a patch, I can try it out but it's not my highest
> priority right now. I'm still looking to get writeback-from-reclaim down
> to a reasonable level without causing a large amount of churn.

OK. That's already a great work.
 
> Here is the first pass anyway at kicking wakeup_flusher_threads() for
> inodes belonging to a list of pages. You'll note that I do nothing with
> page offset because I didn't spot a simple way of taking that
> information into account. It's also horrible from a locking perspective.
> So far, it's testing has been "it didn't crash".

It seems a neat way to prioritize the inodes with a new flag
I_DIRTY_RECLAIM. However it may require vastly different
implementation when considering the offset. I'll try to work up a
prototype tomorrow.

Thanks,
Fengguang

 
> ==== CUT HERE ====
> writeback: Prioritise dirty inodes encountered by reclaim for background flushing
> 
> It is preferable that as few dirty pages as possible are dispatched for
> cleaning from the page reclaim path. When dirty pages are encountered by
> page reclaim, this patch marks the inodes that they should be dispatched
> immediately. When the background flusher runs, it moves such inodes immediately
> to the dispatch queue regardless of inode age.
> 
> This is an early prototype. It could be optimised to not regularly take
> the inode lock repeatedly and ideally the page offset would also be
> taken into account.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  fs/fs-writeback.c         |   52 ++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/fs.h        |    5 ++-
>  include/linux/writeback.h |    1 +
>  mm/vmscan.c               |    6 +++-
>  4 files changed, 59 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 5a3c764..27a8b75 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -221,7 +221,7 @@ static void move_expired_inodes(struct list_head *delaying_queue,
>  	LIST_HEAD(tmp);
>  	struct list_head *pos, *node;
>  	struct super_block *sb = NULL;
> -	struct inode *inode;
> +	struct inode *inode, *tinode;
>  	int do_sb_sort = 0;
>  
>  	if (wbc->for_kupdate || wbc->for_background) {
> @@ -229,6 +229,14 @@ static void move_expired_inodes(struct list_head *delaying_queue,
>  		older_than_this = jiffies - expire_interval;
>  	}
>  
> +	/* Move inodes reclaim found at end of LRU to dispatch queue */
> +	list_for_each_entry_safe(inode, tinode, delaying_queue, i_list) {
> +		if (inode->i_state & I_DIRTY_RECLAIM) {
> +			inode->i_state &= ~I_DIRTY_RECLAIM;
> +			list_move(&inode->i_list, &tmp);
> +		}
> +	}
> +
>  	while (!list_empty(delaying_queue)) {
>  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
>  		if (expire_interval &&
> @@ -906,6 +914,48 @@ void wakeup_flusher_threads(long nr_pages)
>  	rcu_read_unlock();
>  }
>  
> +/*
> + * Similar to wakeup_flusher_threads except prioritise inodes contained
> + * in the page_list regardless of age
> + */
> +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list)
> +{
> +	struct page *page;
> +	struct address_space *mapping;
> +	struct inode *inode;
> +
> +	list_for_each_entry(page, page_list, lru) {
> +		if (!PageDirty(page))
> +			continue;
> +
> +		lock_page(page);
> +		mapping = page_mapping(page);
> +		if (!mapping || mapping == &swapper_space)
> +			goto unlock;
> +
> +		/*
> +		 * Test outside the lock to see as if it is already set, taking
> +		 * the inode lock is a waste and the inode should be pinned by
> +		 * the lock_page
> +		 */
> +		inode = page->mapping->host;
> +		if (inode->i_state & I_DIRTY_RECLAIM)
> +			goto unlock;
> +
> +		/*
> +		 * XXX: Yuck, has to be a way of batching this by not requiring
> +		 * 	the page lock to pin the inode
> +		 */
> +		spin_lock(&inode_lock);
> +		inode->i_state |= I_DIRTY_RECLAIM;
> +		spin_unlock(&inode_lock);
> +unlock:
> +		unlock_page(page);
> +	}
> +
> +	wakeup_flusher_threads(nr_pages);
> +}
> +
>  static noinline void block_dump___mark_inode_dirty(struct inode *inode)
>  {
>  	if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e29f0ed..8836698 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1585,8 +1585,8 @@ struct super_operations {
>  /*
>   * Inode state bits.  Protected by inode_lock.
>   *
> - * Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
> - * I_DIRTY_DATASYNC and I_DIRTY_PAGES.
> + * Four bits determine the dirty state of the inode, I_DIRTY_SYNC,
> + * I_DIRTY_DATASYNC, I_DIRTY_PAGES and I_DIRTY_RECLAIM.
>   *
>   * Four bits define the lifetime of an inode.  Initially, inodes are I_NEW,
>   * until that flag is cleared.  I_WILL_FREE, I_FREEING and I_CLEAR are set at
> @@ -1633,6 +1633,7 @@ struct super_operations {
>  #define I_DIRTY_SYNC		1
>  #define I_DIRTY_DATASYNC	2
>  #define I_DIRTY_PAGES		4
> +#define I_DIRTY_RECLAIM		256
>  #define __I_NEW			3
>  #define I_NEW			(1 << __I_NEW)
>  #define I_WILL_FREE		16
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 494edd6..73a4df2 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -64,6 +64,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
>  		struct writeback_control *wbc);
>  long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
>  void wakeup_flusher_threads(long nr_pages);
> +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list);
>  
>  /* writeback.h requires fs.h; it, too, is not included from here. */
>  static inline void wait_on_inode(struct inode *inode)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b66d1f5..bad1abf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -901,7 +901,8 @@ keep:
>  	 * laptop mode avoiding disk spin-ups
>  	 */
>  	if (file && nr_dirty_seen && sc->may_writepage)
> -		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> +		wakeup_flusher_threads_pages(nr_writeback_pages(nr_dirty),
> +					page_list);
>  
>  	*nr_still_dirty = nr_dirty;
>  	count_vm_events(PGACTIVATE, pgactivate);
> @@ -1368,7 +1369,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  				list_add(&page->lru, &putback_list);
>  			}
>  
> -			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> +			wakeup_flusher_threads_pages(laptop_mode ? 0 : nr_dirty,
> +								&page_list);
>  			congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  			/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 177+ messages in thread

end of thread, other threads:[~2010-07-27 15:22 UTC | newest]

Thread overview: 177+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-19 13:11 [PATCH 0/8] Reduce writeback from page reclaim context V4 Mel Gorman
2010-07-19 13:11 ` Mel Gorman
2010-07-19 13:11 ` [PATCH 1/8] vmscan: tracing: Roll up of patches currently in mmotm Mel Gorman
2010-07-19 13:11   ` Mel Gorman
2010-07-19 13:11 ` [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages Mel Gorman
2010-07-19 13:11   ` Mel Gorman
2010-07-19 13:24   ` Rik van Riel
2010-07-19 13:24     ` Rik van Riel
2010-07-19 14:15   ` Christoph Hellwig
2010-07-19 14:15     ` Christoph Hellwig
2010-07-19 14:24     ` Mel Gorman
2010-07-19 14:24       ` Mel Gorman
2010-07-19 14:26       ` Christoph Hellwig
2010-07-19 14:26         ` Christoph Hellwig
2010-07-19 13:11 ` [PATCH 3/8] vmscan: tracing: Update post-processing script to distinguish between anon and file IO from page reclaim Mel Gorman
2010-07-19 13:11   ` Mel Gorman
2010-07-19 13:32   ` Rik van Riel
2010-07-19 13:32     ` Rik van Riel
2010-07-19 13:11 ` [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman
2010-07-19 13:11   ` Mel Gorman
2010-07-19 14:19   ` Christoph Hellwig
2010-07-19 14:19     ` Christoph Hellwig
2010-07-19 14:26     ` Mel Gorman
2010-07-19 14:26       ` Mel Gorman
2010-07-19 18:25   ` Rik van Riel
2010-07-19 18:25     ` Rik van Riel
2010-07-19 22:14   ` Johannes Weiner
2010-07-19 22:14     ` Johannes Weiner
2010-07-20 13:45     ` Mel Gorman
2010-07-20 13:45       ` Mel Gorman
2010-07-20 22:02       ` Johannes Weiner
2010-07-20 22:02         ` Johannes Weiner
2010-07-21 11:36         ` Johannes Weiner
2010-07-21 11:36           ` Johannes Weiner
2010-07-21 11:52         ` Mel Gorman
2010-07-21 11:52           ` Mel Gorman
2010-07-21 12:01           ` KAMEZAWA Hiroyuki
2010-07-21 12:01             ` KAMEZAWA Hiroyuki
2010-07-21 14:27             ` Mel Gorman
2010-07-21 14:27               ` Mel Gorman
2010-07-21 23:57               ` KAMEZAWA Hiroyuki
2010-07-21 23:57                 ` KAMEZAWA Hiroyuki
2010-07-22  9:19                 ` Mel Gorman
2010-07-22  9:19                   ` Mel Gorman
2010-07-22  9:22                   ` KAMEZAWA Hiroyuki
2010-07-22  9:22                     ` KAMEZAWA Hiroyuki
2010-07-21 13:04           ` Johannes Weiner
2010-07-21 13:04             ` Johannes Weiner
2010-07-21 13:38             ` Mel Gorman
2010-07-21 13:38               ` Mel Gorman
2010-07-21 14:28               ` Johannes Weiner
2010-07-21 14:28                 ` Johannes Weiner
2010-07-21 14:31                 ` Mel Gorman
2010-07-21 14:31                   ` Mel Gorman
2010-07-21 14:39                   ` Johannes Weiner
2010-07-21 14:39                     ` Johannes Weiner
2010-07-21 15:06                     ` Mel Gorman
2010-07-21 15:06                       ` Mel Gorman
2010-07-26  8:29               ` Wu Fengguang
2010-07-26  8:29                 ` Wu Fengguang
2010-07-26  9:12                 ` Mel Gorman
2010-07-26  9:12                   ` Mel Gorman
2010-07-26 11:19                   ` Wu Fengguang
2010-07-26 11:19                     ` Wu Fengguang
2010-07-26 12:53                     ` Mel Gorman
2010-07-26 12:53                       ` Mel Gorman
2010-07-26 13:03                       ` Wu Fengguang
2010-07-26 13:03                         ` Wu Fengguang
2010-07-19 13:11 ` [PATCH 5/8] fs,btrfs: Allow kswapd to writeback pages Mel Gorman
2010-07-19 13:11   ` Mel Gorman
2010-07-19 18:27   ` Rik van Riel
2010-07-19 18:27     ` Rik van Riel
2010-07-19 13:11 ` [PATCH 6/8] fs,xfs: " Mel Gorman
2010-07-19 13:11   ` Mel Gorman
2010-07-19 14:20   ` Christoph Hellwig
2010-07-19 14:20     ` Christoph Hellwig
2010-07-19 14:43     ` Mel Gorman
2010-07-19 14:43       ` Mel Gorman
2010-07-19 13:11 ` [PATCH 7/8] writeback: sync old inodes first in background writeback Mel Gorman
2010-07-19 13:11   ` Mel Gorman
2010-07-19 14:21   ` Christoph Hellwig
2010-07-19 14:21     ` Christoph Hellwig
2010-07-19 14:40     ` Mel Gorman
2010-07-19 14:40       ` Mel Gorman
2010-07-19 14:48       ` Christoph Hellwig
2010-07-19 14:48         ` Christoph Hellwig
2010-07-22  8:52       ` Wu Fengguang
2010-07-22  8:52         ` Wu Fengguang
2010-07-22  9:02         ` Wu Fengguang
2010-07-22  9:02           ` Wu Fengguang
2010-07-22  9:21         ` Wu Fengguang
2010-07-22  9:21           ` Wu Fengguang
2010-07-22 10:48           ` Mel Gorman
2010-07-22 10:48             ` Mel Gorman
2010-07-23  9:45             ` Wu Fengguang
2010-07-23  9:45               ` Wu Fengguang
2010-07-23 10:57               ` Mel Gorman
2010-07-23 10:57                 ` Mel Gorman
2010-07-23 11:49                 ` Wu Fengguang
2010-07-23 11:49                   ` Wu Fengguang
2010-07-23 12:20                   ` Wu Fengguang
2010-07-23 12:20                     ` Wu Fengguang
2010-07-25 10:43                 ` KOSAKI Motohiro
2010-07-25 10:43                   ` KOSAKI Motohiro
2010-07-25 12:03                   ` Minchan Kim
2010-07-25 12:03                     ` Minchan Kim
2010-07-26  3:27                     ` Wu Fengguang
2010-07-26  3:27                       ` Wu Fengguang
2010-07-26  4:11                       ` Minchan Kim
2010-07-26  4:11                         ` Minchan Kim
2010-07-26  4:37                         ` Wu Fengguang
2010-07-26  4:37                           ` Wu Fengguang
2010-07-26  4:37                           ` Wu Fengguang
2010-07-26 16:30                           ` Minchan Kim
2010-07-26 16:30                             ` Minchan Kim
2010-07-26 16:30                             ` Minchan Kim
2010-07-26 22:48                             ` Wu Fengguang
2010-07-26 22:48                               ` Wu Fengguang
2010-07-26 22:48                               ` Wu Fengguang
2010-07-26  3:08                   ` Wu Fengguang
2010-07-26  3:08                     ` Wu Fengguang
2010-07-26  3:11                     ` Rik van Riel
2010-07-26  3:11                       ` Rik van Riel
2010-07-26  3:17                       ` Wu Fengguang
2010-07-26  3:17                         ` Wu Fengguang
2010-07-22 15:34           ` Minchan Kim
2010-07-22 15:34             ` Minchan Kim
2010-07-23 11:59             ` Wu Fengguang
2010-07-23 11:59               ` Wu Fengguang
2010-07-22  9:42         ` Mel Gorman
2010-07-22  9:42           ` Mel Gorman
2010-07-23  8:33           ` Wu Fengguang
2010-07-23  8:33             ` Wu Fengguang
2010-07-22  1:13     ` Wu Fengguang
2010-07-22  1:13       ` Wu Fengguang
2010-07-19 18:43   ` Rik van Riel
2010-07-19 18:43     ` Rik van Riel
2010-07-19 13:11 ` [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages Mel Gorman
2010-07-19 13:11   ` Mel Gorman
2010-07-19 14:23   ` Christoph Hellwig
2010-07-19 14:23     ` Christoph Hellwig
2010-07-19 14:37     ` Mel Gorman
2010-07-19 14:37       ` Mel Gorman
2010-07-19 22:48       ` Johannes Weiner
2010-07-19 22:48         ` Johannes Weiner
2010-07-20 14:10         ` Mel Gorman
2010-07-20 14:10           ` Mel Gorman
2010-07-20 22:05           ` Johannes Weiner
2010-07-20 22:05             ` Johannes Weiner
2010-07-19 18:59   ` Rik van Riel
2010-07-19 18:59     ` Rik van Riel
2010-07-19 22:26   ` Johannes Weiner
2010-07-19 22:26     ` Johannes Weiner
2010-07-26  7:28   ` Wu Fengguang
2010-07-26  7:28     ` Wu Fengguang
2010-07-26  9:26     ` Mel Gorman
2010-07-26  9:26       ` Mel Gorman
2010-07-26 11:27       ` Wu Fengguang
2010-07-26 11:27         ` Wu Fengguang
2010-07-26 12:57         ` Mel Gorman
2010-07-26 12:57           ` Mel Gorman
2010-07-26 13:10           ` Wu Fengguang
2010-07-26 13:10             ` Wu Fengguang
2010-07-27 13:35             ` Mel Gorman
2010-07-27 13:35               ` Mel Gorman
2010-07-27 14:24               ` Wu Fengguang
2010-07-27 14:24                 ` Wu Fengguang
2010-07-27 14:34                 ` Wu Fengguang
2010-07-27 14:34                   ` Wu Fengguang
2010-07-27 14:40                   ` Mel Gorman
2010-07-27 14:40                     ` Mel Gorman
2010-07-27 14:55                     ` Wu Fengguang
2010-07-27 14:55                       ` Wu Fengguang
2010-07-27 14:38                 ` Mel Gorman
2010-07-27 14:38                   ` Mel Gorman
2010-07-27 15:21                   ` Wu Fengguang
2010-07-27 15:21                     ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.