All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] Reduce writeback from page reclaim context V6
@ 2010-07-30 13:36 ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2010-07-30 13:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrea Arcangeli, Mel Gorman

This is a follow-on series from "Avoid overflowing of stack during page
reclaim". It eliminates writeback requiring a filesystem from direct reclaim
and follows on by reducing the amount of IO required from page reclaim to
mitigate any corner cases from the modification.

Most of this series updates what is already in mmotm.

Changelog since V5
  o Remove the writeback-related patches. They are still undergoing
    changes and while they complement this series, the two series do
    not depend on each other.

Changelog since V4
  o Add patch to prioritise inodes for writeback
  o Drop modifications to XFS and btrfs
  o Correct units in post-processing script
  o Add new patches from Wu related to writeback
  o Only kick flusher threads when dirty file pages are countered
  o Increase size of writeback window when reclaim encounters dirty pages
  o Remove looping logic from shrink_page_list and instead do it all from
    shrink_inactive_list
  o Rebase to 2.6.35-rc6

Changelog since V3
  o Distinguish between file and anon related IO from page reclaim
  o Allow anon writeback from reclaim context
  o Sync old inodes first in background writeback
  o Pre-emptively clean pages when dirty pages are encountered on the LRU
  o Rebase to 2.6.35-rc5

Changelog since V2
  o Add acks and reviewed-bys
  o Do not lock multiple pages at the same time for writeback as it's unsafe
  o Drop the clean_page_list function. It alters timing with very little
    benefit. Without the contiguous writing, it doesn't do much to simplify
    the subsequent patches either
  o Throttle processes that encounter dirty pages in direct reclaim. Instead
    wakeup flusher threads to clean the number of pages encountered that were
    dirty
 
Changelog since V1
  o Merge with series that reduces stack usage in page reclaim in general
  o Allow memcg to writeback pages as they are not expected to overflow stack
  o Drop the contiguous-write patch for the moment

There is a problem in the stack depth usage of page reclaim. Particularly
during direct reclaim, it is possible to overflow the stack if it calls into
the filesystems writepage function. This patch series begins by preventing
writeback from direct reclaim.  As this is a potentially large change,
the last patch aims to reduce any filesystem writeback from page reclaim
and depend more on background flush.

The first patch in the series is a roll-up of what is currently in mmotm. It's
provided for convenience of testing.

Patch 2 and 3 note that it is important to distinguish between file and anon
page writeback from page reclaim as they use stack to different depths. It
updates the trace points and scripts appropriately noting which mmotm patch
they should be merged with.

Patch 4 notes that the units in the report are incorrect and fixes it.

Patch 5 prevents direct reclaim writing out filesystem pages while still
allowing writeback of anon pages which is in less danger of stack overflow
and doesn't have something like background flush to clean the pages.
For filesystem pages, flusher threads are asked to clean the number of
pages encountered, the caller waits on congestion and puts the pages back
on the LRU.  For lumpy reclaim, the caller will wait for a time calling the
flusher multiple times waiting on dirty pages to be written out before trying
to reclaim the dirty pages a second time. This increases the responsibility
of kswapd somewhat because it's now cleaning pages on behalf of direct
reclaimers but unlike background flushers, kswapd knows what zone pages
need to be cleaned from. As it is async IO, it should not cause kswapd to
stall (at least until the queue is congested) but the order that pages are
reclaimed on the LRU is altered. Dirty pages that would have been reclaimed
by direct reclaimers are getting another lap on the LRU. The dirty pages
could have been put on a dedicated list but this increased counter overhead
and the number of lists and it is unclear if it is necessary.

Patch 6 notes that dirty pages can still be found at the end of the LRU.
If a number of them are encountered, it's reasonable to assume that a similar
number of dirty pages will be discovered in the very near future as that was
the dirtying pattern at the time. The patch pre-emptively kicks background
flusher to clean a number of pages creating feedback from page reclaim to
background flusher that is based on scanning rates.

I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were

X86:	Intel P4 2-core
X86-64:	AMD Phenom 4-core
PPC64:	PPC970MP

Each used a single disk and the onboard IO controller. Dirty ratio was left
at 20. Tests on an earlier series indicated that moving to 40 did not make
much difference. The filesystem used for all tests was XFS.

Five kernels are compared.

traceonly-v6		is the first 4 patches of this series
nodirect-v6		is the first 5 patches
flushforward-v6		pre-emptively cleans pages when encountered on the LRU (patch 1-8)
flushprio-v5		flags inodes with dirty pages at end of LRU (patch 1-9)

The results on each test is broken up into two parts.  The first part is
a report based on the ftrace postprocessing script and reports on direct
reclaim and kswapd activity. The second part reports what percentage of
time was spent in direct reclaim, kswapd being awake and the percentage of
pages scanned that were dirty.

To work out the percentage of time spent in direct reclaim, I used
/usr/bin/time to get the User + Sys CPU time. The stalled time was taken
from the post-processing script.  The total time is (User + Sys + Stall)
and obviously the percentage is of stalled over total time.

I am omitting the actual performance results simply because they are not
interesting with very few significant changes.

kernbench
=========

No writeback from reclaim initiated and no performance change of significance.

IOzone
======

No writeback from reclaim initiated and no performance change of significance.

SysBench
========

The results were based on a read/write and as the machine is under-provisioned
for the type of tests, figures are very unstable so not reported.  with
variances up to 15%. Part of the problem is that larger thread counts push
the test into swap as the memory is insufficient and destabilises results
further. I could tune for this, but it was reclaim that was important.

X86
                    traceonly-v6         nodirect-v6     flushforward-v6
Direct reclaims                                 17         42          5 
Direct reclaim pages scanned                  3766       4809        361 
Direct reclaim write file async I/O           1658          0          0 
Direct reclaim write anon async I/O              0        315          3 
Direct reclaim write file sync I/O               0          0          0 
Direct reclaim write anon sync I/O               0          0          0 
Wake kswapd requests                        229080     262515     240991 
Kswapd wakeups                                 578        646        567 
Kswapd pages scanned                      12822445   13646919   11443966 
Kswapd reclaim write file async I/O         488806     417628       1676 
Kswapd reclaim write anon async I/O         132832     143463     110880 
Kswapd reclaim write file sync I/O               0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0 
Time stalled direct reclaim (seconds)         0.10       1.48       0.00 
Time kswapd awake (seconds)                1035.89    1051.81     846.99 

Total pages scanned                       12826211  13651728  11444327
Percentage pages scanned/written             4.86%     4.11%     0.98%
User/Sys Time Running Test (seconds)       1268.94   1313.47   1251.05
Percentage Time Spent Direct Reclaim         0.01%     0.11%     0.00%
Total Elapsed Time (seconds)               7669.42   8198.84   7583.72
Percentage Time kswapd Awake                13.51%    12.83%    11.17%

Dirty file pages in direct reclaim on the X86 test machine were not much
of a problem to begin with and the patches eliminate them as expected and
time to complete the test was not negatively impacted as a result.

Pre-emptively writing back a window of dirty pages when countered on the
LRU makes a big difference - the number of dirty file pages encountered by
kswapd was reduced by 99% and the percentage of dirty pages encountered is
reduced to less than 1%, most of which were anon.

X86-64
                    traceonly-v6         nodirect-v6     flushforward-v6
Direct reclaims                                906        700        897 
Direct reclaim pages scanned                161635     221601      62442 
Direct reclaim write file async I/O          16881          0          0 
Direct reclaim write anon async I/O           2558        562        706 
Direct reclaim write file sync I/O              24          0          0 
Direct reclaim write anon sync I/O               0          0          0 
Wake kswapd requests                        844622     688841     803158 
Kswapd wakeups                                1480       1466       1529 
Kswapd pages scanned                      16194333   16558633   15386430 
Kswapd reclaim write file async I/O         460459     843545     193560 
Kswapd reclaim write anon async I/O         243146     269235     210824 
Kswapd reclaim write file sync I/O               0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0 
Time stalled direct reclaim (seconds)        19.75      29.33       5.71 
Time kswapd awake (seconds)                2067.45    2058.20    2108.51 

Total pages scanned                       16355968  16780234  15448872
Percentage pages scanned/written             4.42%     6.63%     2.62%
User/Sys Time Running Test (seconds)        634.69    637.54    659.72
Percentage Time Spent Direct Reclaim         3.02%     4.40%     0.86%
Total Elapsed Time (seconds)               6197.20   6234.80   6591.33
Percentage Time kswapd Awake                33.36%    33.01%    31.99%

Direct reclaim of filesystem pages is eliminated as expected without an
impact on time although kswapd had to write back more pages as a result.
Again the full series reduces the percentage of dirtyp ages encountered
while scanning and overall, there is less reclaim activity.

PPC64
                    traceonly-v6         nodirect-v6     flushforward-v6
Direct reclaims                               3378       4151       5658 
Direct reclaim pages scanned                380441     267139     495713 
Direct reclaim write file async I/O          35532          0          0 
Direct reclaim write anon async I/O          18863      17160      30672 
Direct reclaim write file sync I/O               9          0          0 
Direct reclaim write anon sync I/O               0          0          2 
Wake kswapd requests                       1666305    1355794    1949445 
Kswapd wakeups                                 533        509        551 
Kswapd pages scanned                      16206261   15447359   15524846 
Kswapd reclaim write file async I/O        1690129    1749868    1152304 
Kswapd reclaim write anon async I/O         121416     151389     147141 
Kswapd reclaim write file sync I/O               0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0 
Time stalled direct reclaim (seconds)        90.84      69.37      74.36 
Time kswapd awake (seconds)                1932.31    1802.39    1999.15 

Total pages scanned                       16586702  15714498  16020559
Percentage pages scanned/written            11.25%    12.21%     8.30%
User/Sys Time Running Test (seconds)       1315.49   1249.23   1314.83
Percentage Time Spent Direct Reclaim         6.46%     5.26%     5.35%
Total Elapsed Time (seconds)               8581.41   7988.79   8719.56
Percentage Time kswapd Awake                22.52%    22.56%    22.93%

Direct reclaim filesystem writes are eliminated of course and the percentage
of dirty pages encountered is reduced.

Stress HighAlloc
================

This test builds a large number of kernels simultaneously so that the total
workload is 1.5 times the size of RAM. It then attempts to allocate all of
RAM as huge pages. The metric is the percentage of memory allocated using
load (Pass 1), a second attempt under load (Pass 2) and when the kernel
compiles are finishes and the system is quiet (At Rest). The patches have
little impact on the success rates.

X86
                    traceonly-v6         nodirect-v6     flushforward-v6
Direct reclaims                                555        496        677 
Direct reclaim pages scanned                187498      83022      91321 
Direct reclaim write file async I/O            684          0          0 
Direct reclaim write anon async I/O          33869       5834       7723 
Direct reclaim write file sync I/O             385          0          0 
Direct reclaim write anon sync I/O           23225        428        191 
Wake kswapd requests                          1613       1484       1805 
Kswapd wakeups                                 517        342        664 
Kswapd pages scanned                      27791653    2570033    3023077 
Kswapd reclaim write file async I/O         308778      19758        345 
Kswapd reclaim write anon async I/O        5232938     109227     167984 
Kswapd reclaim write file sync I/O               0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0 
Time stalled direct reclaim (seconds)     18223.83     282.49     392.66 
Time kswapd awake (seconds)               15911.61     307.05     452.35 

Total pages scanned                       27979151   2653055   3114398
Percentage pages scanned/written            20.01%     5.10%     5.66%
User/Sys Time Running Test (seconds)       2806.35   1765.22   1873.86
Percentage Time Spent Direct Reclaim        86.66%    13.80%    17.32%
Total Elapsed Time (seconds)              20382.81   2383.34   2491.23
Percentage Time kswapd Awake                78.06%    12.88%    18.16%

Total time running the test was massively reduced by the series and writebacks
from page reclaim are reduced to almost negligible levels.  The percentage
of dirty pages written is much reduced but obviously remains high as there
isn't an equivalent of background flushers for anon pages.

X86-64
                    traceonly-v6         nodirect-v6     flushforward-v6
Direct reclaims                               1159       1112       1066 
Direct reclaim pages scanned                172491     147763     142100 
Direct reclaim write file async I/O           2496          0          0 
Direct reclaim write anon async I/O          32486      19527      15355 
Direct reclaim write file sync I/O            1913          0          0 
Direct reclaim write anon sync I/O           14434       2806       3704 
Wake kswapd requests                          1159       1101       1061 
Kswapd wakeups                                1110        827        785 
Kswapd pages scanned                      23467327    8064964    4873397 
Kswapd reclaim write file async I/O         652531      86003       9135 
Kswapd reclaim write anon async I/O        2476541     500556     205612 
Kswapd reclaim write file sync I/O               0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0 
Time stalled direct reclaim (seconds)      7906.48    1355.70     428.86 
Time kswapd awake (seconds)                4263.89    1029.43     468.59 

Total pages scanned                       23639818   8212727   5015497
Percentage pages scanned/written            13.45%     7.41%     4.66%
User/Sys Time Running Test (seconds)       2806.01   2744.46   2789.54
Percentage Time Spent Direct Reclaim        73.81%    33.06%    13.33%
Total Elapsed Time (seconds)              10274.33   3705.47   2812.54
Percentage Time kswapd Awake                41.50%    27.78%    16.66%

Again, the test completes far faster with the full series and fewer dirty
pages are encountered. File writebacks from kswapd are reduced to negligible
levels.

PPC64
                    traceonly-v6         nodirect-v6     flushforward-v6
Direct reclaims                                580        529        648 
Direct reclaim pages scanned                111382      92480     106061 
Direct reclaim write file async I/O            673          0          0 
Direct reclaim write anon async I/O          23361      14769      15701 
Direct reclaim write file sync I/O             300          0          0 
Direct reclaim write anon sync I/O           12224      10106       1803 
Wake kswapd requests                           302        276        305 
Kswapd wakeups                                 220        206        140 
Kswapd pages scanned                      10071156    7110936    3622584 
Kswapd reclaim write file async I/O         261563      59626       6818 
Kswapd reclaim write anon async I/O        2230514     689606     422745 
Kswapd reclaim write file sync I/O               0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0 
Time stalled direct reclaim (seconds)      5366.14    1668.51     974.11 
Time kswapd awake (seconds)                5094.97    1621.02    1030.18 

Total pages scanned                       10182538   7203416   3728645
Percentage pages scanned/written            24.83%    10.75%    11.99%
User/Sys Time Running Test (seconds)       3398.37   2615.25   2234.56
Percentage Time Spent Direct Reclaim        61.23%    38.95%    30.36%
Total Elapsed Time (seconds)               6990.13   3174.43   2459.29
Percentage Time kswapd Awake                72.89%    51.06%    41.89%

Again, far faster completion times with a significant reduction in the
amount of dirty pages encountered.

Overall the full series eliminates calling into the filesystem from page
reclaim while massively reducing the number of dirty file pages encountered
by page reclaim. There was a concern that no file writeback from page reclaim
would cause problems and it still might but preliminary data show that the
number of dirty pages encountered is so small that it's not likely to be
a problem.

There is ongoing work in writeback that should help further reduce the
number of dirty pages encountered but the series complement rather than
collide with each other so there is no merge dependency.

Any objections to merging?

Mel Gorman (6):
  vmscan: tracing: Roll up of patches currently in mmotm
  vmscan: tracing: Update trace event to track if page reclaim IO is
    for anon or file pages
  vmscan: tracing: Update post-processing script to distinguish between
    anon and file IO from page reclaim
  vmscan: tracing: Correct units in post-processing script
  vmscan: Do not writeback filesystem pages in direct reclaim
  vmscan: Kick flusher threads to clean pages when reclaim is
    encountering dirty pages

 .../trace/postprocess/trace-vmscan-postprocess.pl  |  686 ++++++++++++++++++++
 include/linux/memcontrol.h                         |    5 -
 include/linux/mmzone.h                             |   15 -
 include/trace/events/gfpflags.h                    |   37 +
 include/trace/events/kmem.h                        |   38 +-
 include/trace/events/vmscan.h                      |  202 ++++++
 mm/memcontrol.c                                    |   31 -
 mm/page_alloc.c                                    |    2 -
 mm/vmscan.c                                        |  481 ++++++++------
 mm/vmstat.c                                        |    2 -
 10 files changed, 1205 insertions(+), 294 deletions(-)
 create mode 100644 Documentation/trace/postprocess/trace-vmscan-postprocess.pl
 create mode 100644 include/trace/events/gfpflags.h
 create mode 100644 include/trace/events/vmscan.h


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 0/6] Reduce writeback from page reclaim context V6
@ 2010-07-30 13:36 ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2010-07-30 13:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrea Arcangeli, Mel Gorman

This is a follow-on series from "Avoid overflowing of stack during page
reclaim". It eliminates writeback requiring a filesystem from direct reclaim
and follows on by reducing the amount of IO required from page reclaim to
mitigate any corner cases from the modification.

Most of this series updates what is already in mmotm.

Changelog since V5
  o Remove the writeback-related patches. They are still undergoing
    changes and while they complement this series, the two series do
    not depend on each other.

Changelog since V4
  o Add patch to prioritise inodes for writeback
  o Drop modifications to XFS and btrfs
  o Correct units in post-processing script
  o Add new patches from Wu related to writeback
  o Only kick flusher threads when dirty file pages are countered
  o Increase size of writeback window when reclaim encounters dirty pages
  o Remove looping logic from shrink_page_list and instead do it all from
    shrink_inactive_list
  o Rebase to 2.6.35-rc6

Changelog since V3
  o Distinguish between file and anon related IO from page reclaim
  o Allow anon writeback from reclaim context
  o Sync old inodes first in background writeback
  o Pre-emptively clean pages when dirty pages are encountered on the LRU
  o Rebase to 2.6.35-rc5

Changelog since V2
  o Add acks and reviewed-bys
  o Do not lock multiple pages at the same time for writeback as it's unsafe
  o Drop the clean_page_list function. It alters timing with very little
    benefit. Without the contiguous writing, it doesn't do much to simplify
    the subsequent patches either
  o Throttle processes that encounter dirty pages in direct reclaim. Instead
    wakeup flusher threads to clean the number of pages encountered that were
    dirty
 
Changelog since V1
  o Merge with series that reduces stack usage in page reclaim in general
  o Allow memcg to writeback pages as they are not expected to overflow stack
  o Drop the contiguous-write patch for the moment

There is a problem in the stack depth usage of page reclaim. Particularly
during direct reclaim, it is possible to overflow the stack if it calls into
the filesystems writepage function. This patch series begins by preventing
writeback from direct reclaim.  As this is a potentially large change,
the last patch aims to reduce any filesystem writeback from page reclaim
and depend more on background flush.

The first patch in the series is a roll-up of what is currently in mmotm. It's
provided for convenience of testing.

Patch 2 and 3 note that it is important to distinguish between file and anon
page writeback from page reclaim as they use stack to different depths. It
updates the trace points and scripts appropriately noting which mmotm patch
they should be merged with.

Patch 4 notes that the units in the report are incorrect and fixes it.

Patch 5 prevents direct reclaim writing out filesystem pages while still
allowing writeback of anon pages which is in less danger of stack overflow
and doesn't have something like background flush to clean the pages.
For filesystem pages, flusher threads are asked to clean the number of
pages encountered, the caller waits on congestion and puts the pages back
on the LRU.  For lumpy reclaim, the caller will wait for a time calling the
flusher multiple times waiting on dirty pages to be written out before trying
to reclaim the dirty pages a second time. This increases the responsibility
of kswapd somewhat because it's now cleaning pages on behalf of direct
reclaimers but unlike background flushers, kswapd knows what zone pages
need to be cleaned from. As it is async IO, it should not cause kswapd to
stall (at least until the queue is congested) but the order that pages are
reclaimed on the LRU is altered. Dirty pages that would have been reclaimed
by direct reclaimers are getting another lap on the LRU. The dirty pages
could have been put on a dedicated list but this increased counter overhead
and the number of lists and it is unclear if it is necessary.

Patch 6 notes that dirty pages can still be found at the end of the LRU.
If a number of them are encountered, it's reasonable to assume that a similar
number of dirty pages will be discovered in the very near future as that was
the dirtying pattern at the time. The patch pre-emptively kicks background
flusher to clean a number of pages creating feedback from page reclaim to
background flusher that is based on scanning rates.

I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were

X86:	Intel P4 2-core
X86-64:	AMD Phenom 4-core
PPC64:	PPC970MP

Each used a single disk and the onboard IO controller. Dirty ratio was left
at 20. Tests on an earlier series indicated that moving to 40 did not make
much difference. The filesystem used for all tests was XFS.

Five kernels are compared.

traceonly-v6		is the first 4 patches of this series
nodirect-v6		is the first 5 patches
flushforward-v6		pre-emptively cleans pages when encountered on the LRU (patch 1-8)
flushprio-v5		flags inodes with dirty pages at end of LRU (patch 1-9)

The results on each test is broken up into two parts.  The first part is
a report based on the ftrace postprocessing script and reports on direct
reclaim and kswapd activity. The second part reports what percentage of
time was spent in direct reclaim, kswapd being awake and the percentage of
pages scanned that were dirty.

To work out the percentage of time spent in direct reclaim, I used
/usr/bin/time to get the User + Sys CPU time. The stalled time was taken
from the post-processing script.  The total time is (User + Sys + Stall)
and obviously the percentage is of stalled over total time.

I am omitting the actual performance results simply because they are not
interesting with very few significant changes.

kernbench
=========

No writeback from reclaim initiated and no performance change of significance.

IOzone
======

No writeback from reclaim initiated and no performance change of significance.

SysBench
========

The results were based on a read/write and as the machine is under-provisioned
for the type of tests, figures are very unstable so not reported.  with
variances up to 15%. Part of the problem is that larger thread counts push
the test into swap as the memory is insufficient and destabilises results
further. I could tune for this, but it was reclaim that was important.

X86
                    traceonly-v6         nodirect-v6     flushforward-v6
Direct reclaims                                 17         42          5 
Direct reclaim pages scanned                  3766       4809        361 
Direct reclaim write file async I/O           1658          0          0 
Direct reclaim write anon async I/O              0        315          3 
Direct reclaim write file sync I/O               0          0          0 
Direct reclaim write anon sync I/O               0          0          0 
Wake kswapd requests                        229080     262515     240991 
Kswapd wakeups                                 578        646        567 
Kswapd pages scanned                      12822445   13646919   11443966 
Kswapd reclaim write file async I/O         488806     417628       1676 
Kswapd reclaim write anon async I/O         132832     143463     110880 
Kswapd reclaim write file sync I/O               0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0 
Time stalled direct reclaim (seconds)         0.10       1.48       0.00 
Time kswapd awake (seconds)                1035.89    1051.81     846.99 

Total pages scanned                       12826211  13651728  11444327
Percentage pages scanned/written             4.86%     4.11%     0.98%
User/Sys Time Running Test (seconds)       1268.94   1313.47   1251.05
Percentage Time Spent Direct Reclaim         0.01%     0.11%     0.00%
Total Elapsed Time (seconds)               7669.42   8198.84   7583.72
Percentage Time kswapd Awake                13.51%    12.83%    11.17%

Dirty file pages in direct reclaim on the X86 test machine were not much
of a problem to begin with and the patches eliminate them as expected and
time to complete the test was not negatively impacted as a result.

Pre-emptively writing back a window of dirty pages when countered on the
LRU makes a big difference - the number of dirty file pages encountered by
kswapd was reduced by 99% and the percentage of dirty pages encountered is
reduced to less than 1%, most of which were anon.

X86-64
                    traceonly-v6         nodirect-v6     flushforward-v6
Direct reclaims                                906        700        897 
Direct reclaim pages scanned                161635     221601      62442 
Direct reclaim write file async I/O          16881          0          0 
Direct reclaim write anon async I/O           2558        562        706 
Direct reclaim write file sync I/O              24          0          0 
Direct reclaim write anon sync I/O               0          0          0 
Wake kswapd requests                        844622     688841     803158 
Kswapd wakeups                                1480       1466       1529 
Kswapd pages scanned                      16194333   16558633   15386430 
Kswapd reclaim write file async I/O         460459     843545     193560 
Kswapd reclaim write anon async I/O         243146     269235     210824 
Kswapd reclaim write file sync I/O               0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0 
Time stalled direct reclaim (seconds)        19.75      29.33       5.71 
Time kswapd awake (seconds)                2067.45    2058.20    2108.51 

Total pages scanned                       16355968  16780234  15448872
Percentage pages scanned/written             4.42%     6.63%     2.62%
User/Sys Time Running Test (seconds)        634.69    637.54    659.72
Percentage Time Spent Direct Reclaim         3.02%     4.40%     0.86%
Total Elapsed Time (seconds)               6197.20   6234.80   6591.33
Percentage Time kswapd Awake                33.36%    33.01%    31.99%

Direct reclaim of filesystem pages is eliminated as expected without an
impact on time although kswapd had to write back more pages as a result.
Again the full series reduces the percentage of dirtyp ages encountered
while scanning and overall, there is less reclaim activity.

PPC64
                    traceonly-v6         nodirect-v6     flushforward-v6
Direct reclaims                               3378       4151       5658 
Direct reclaim pages scanned                380441     267139     495713 
Direct reclaim write file async I/O          35532          0          0 
Direct reclaim write anon async I/O          18863      17160      30672 
Direct reclaim write file sync I/O               9          0          0 
Direct reclaim write anon sync I/O               0          0          2 
Wake kswapd requests                       1666305    1355794    1949445 
Kswapd wakeups                                 533        509        551 
Kswapd pages scanned                      16206261   15447359   15524846 
Kswapd reclaim write file async I/O        1690129    1749868    1152304 
Kswapd reclaim write anon async I/O         121416     151389     147141 
Kswapd reclaim write file sync I/O               0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0 
Time stalled direct reclaim (seconds)        90.84      69.37      74.36 
Time kswapd awake (seconds)                1932.31    1802.39    1999.15 

Total pages scanned                       16586702  15714498  16020559
Percentage pages scanned/written            11.25%    12.21%     8.30%
User/Sys Time Running Test (seconds)       1315.49   1249.23   1314.83
Percentage Time Spent Direct Reclaim         6.46%     5.26%     5.35%
Total Elapsed Time (seconds)               8581.41   7988.79   8719.56
Percentage Time kswapd Awake                22.52%    22.56%    22.93%

Direct reclaim filesystem writes are eliminated of course and the percentage
of dirty pages encountered is reduced.

Stress HighAlloc
================

This test builds a large number of kernels simultaneously so that the total
workload is 1.5 times the size of RAM. It then attempts to allocate all of
RAM as huge pages. The metric is the percentage of memory allocated using
load (Pass 1), a second attempt under load (Pass 2) and when the kernel
compiles are finishes and the system is quiet (At Rest). The patches have
little impact on the success rates.

X86
                    traceonly-v6         nodirect-v6     flushforward-v6
Direct reclaims                                555        496        677 
Direct reclaim pages scanned                187498      83022      91321 
Direct reclaim write file async I/O            684          0          0 
Direct reclaim write anon async I/O          33869       5834       7723 
Direct reclaim write file sync I/O             385          0          0 
Direct reclaim write anon sync I/O           23225        428        191 
Wake kswapd requests                          1613       1484       1805 
Kswapd wakeups                                 517        342        664 
Kswapd pages scanned                      27791653    2570033    3023077 
Kswapd reclaim write file async I/O         308778      19758        345 
Kswapd reclaim write anon async I/O        5232938     109227     167984 
Kswapd reclaim write file sync I/O               0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0 
Time stalled direct reclaim (seconds)     18223.83     282.49     392.66 
Time kswapd awake (seconds)               15911.61     307.05     452.35 

Total pages scanned                       27979151   2653055   3114398
Percentage pages scanned/written            20.01%     5.10%     5.66%
User/Sys Time Running Test (seconds)       2806.35   1765.22   1873.86
Percentage Time Spent Direct Reclaim        86.66%    13.80%    17.32%
Total Elapsed Time (seconds)              20382.81   2383.34   2491.23
Percentage Time kswapd Awake                78.06%    12.88%    18.16%

Total time running the test was massively reduced by the series and writebacks
from page reclaim are reduced to almost negligible levels.  The percentage
of dirty pages written is much reduced but obviously remains high as there
isn't an equivalent of background flushers for anon pages.

X86-64
                    traceonly-v6         nodirect-v6     flushforward-v6
Direct reclaims                               1159       1112       1066 
Direct reclaim pages scanned                172491     147763     142100 
Direct reclaim write file async I/O           2496          0          0 
Direct reclaim write anon async I/O          32486      19527      15355 
Direct reclaim write file sync I/O            1913          0          0 
Direct reclaim write anon sync I/O           14434       2806       3704 
Wake kswapd requests                          1159       1101       1061 
Kswapd wakeups                                1110        827        785 
Kswapd pages scanned                      23467327    8064964    4873397 
Kswapd reclaim write file async I/O         652531      86003       9135 
Kswapd reclaim write anon async I/O        2476541     500556     205612 
Kswapd reclaim write file sync I/O               0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0 
Time stalled direct reclaim (seconds)      7906.48    1355.70     428.86 
Time kswapd awake (seconds)                4263.89    1029.43     468.59 

Total pages scanned                       23639818   8212727   5015497
Percentage pages scanned/written            13.45%     7.41%     4.66%
User/Sys Time Running Test (seconds)       2806.01   2744.46   2789.54
Percentage Time Spent Direct Reclaim        73.81%    33.06%    13.33%
Total Elapsed Time (seconds)              10274.33   3705.47   2812.54
Percentage Time kswapd Awake                41.50%    27.78%    16.66%

Again, the test completes far faster with the full series and fewer dirty
pages are encountered. File writebacks from kswapd are reduced to negligible
levels.

PPC64
                    traceonly-v6         nodirect-v6     flushforward-v6
Direct reclaims                                580        529        648 
Direct reclaim pages scanned                111382      92480     106061 
Direct reclaim write file async I/O            673          0          0 
Direct reclaim write anon async I/O          23361      14769      15701 
Direct reclaim write file sync I/O             300          0          0 
Direct reclaim write anon sync I/O           12224      10106       1803 
Wake kswapd requests                           302        276        305 
Kswapd wakeups                                 220        206        140 
Kswapd pages scanned                      10071156    7110936    3622584 
Kswapd reclaim write file async I/O         261563      59626       6818 
Kswapd reclaim write anon async I/O        2230514     689606     422745 
Kswapd reclaim write file sync I/O               0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0 
Time stalled direct reclaim (seconds)      5366.14    1668.51     974.11 
Time kswapd awake (seconds)                5094.97    1621.02    1030.18 

Total pages scanned                       10182538   7203416   3728645
Percentage pages scanned/written            24.83%    10.75%    11.99%
User/Sys Time Running Test (seconds)       3398.37   2615.25   2234.56
Percentage Time Spent Direct Reclaim        61.23%    38.95%    30.36%
Total Elapsed Time (seconds)               6990.13   3174.43   2459.29
Percentage Time kswapd Awake                72.89%    51.06%    41.89%

Again, far faster completion times with a significant reduction in the
amount of dirty pages encountered.

Overall the full series eliminates calling into the filesystem from page
reclaim while massively reducing the number of dirty file pages encountered
by page reclaim. There was a concern that no file writeback from page reclaim
would cause problems and it still might but preliminary data show that the
number of dirty pages encountered is so small that it's not likely to be
a problem.

There is ongoing work in writeback that should help further reduce the
number of dirty pages encountered but the series complement rather than
collide with each other so there is no merge dependency.

Any objections to merging?

Mel Gorman (6):
  vmscan: tracing: Roll up of patches currently in mmotm
  vmscan: tracing: Update trace event to track if page reclaim IO is
    for anon or file pages
  vmscan: tracing: Update post-processing script to distinguish between
    anon and file IO from page reclaim
  vmscan: tracing: Correct units in post-processing script
  vmscan: Do not writeback filesystem pages in direct reclaim
  vmscan: Kick flusher threads to clean pages when reclaim is
    encountering dirty pages

 .../trace/postprocess/trace-vmscan-postprocess.pl  |  686 ++++++++++++++++++++
 include/linux/memcontrol.h                         |    5 -
 include/linux/mmzone.h                             |   15 -
 include/trace/events/gfpflags.h                    |   37 +
 include/trace/events/kmem.h                        |   38 +-
 include/trace/events/vmscan.h                      |  202 ++++++
 mm/memcontrol.c                                    |   31 -
 mm/page_alloc.c                                    |    2 -
 mm/vmscan.c                                        |  481 ++++++++------
 mm/vmstat.c                                        |    2 -
 10 files changed, 1205 insertions(+), 294 deletions(-)
 create mode 100644 Documentation/trace/postprocess/trace-vmscan-postprocess.pl
 create mode 100644 include/trace/events/gfpflags.h
 create mode 100644 include/trace/events/vmscan.h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 1/6] vmscan: tracing: Roll up of patches currently in mmotm
  2010-07-30 13:36 ` Mel Gorman
@ 2010-07-30 13:36   ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2010-07-30 13:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrea Arcangeli, Mel Gorman

This is a roll-up of patches currently in mmotm related to stack reduction and
tracing reclaim. It is based on 2.6.35-rc6 and included for the convenience
of testing.

No signed off required.
---
 .../trace/postprocess/trace-vmscan-postprocess.pl  |  654 ++++++++++++++++++++
 include/linux/memcontrol.h                         |    5 -
 include/linux/mmzone.h                             |   15 -
 include/trace/events/gfpflags.h                    |   37 ++
 include/trace/events/kmem.h                        |   38 +--
 include/trace/events/vmscan.h                      |  184 ++++++
 mm/memcontrol.c                                    |   31 -
 mm/page_alloc.c                                    |    2 -
 mm/vmscan.c                                        |  429 +++++++-------
 mm/vmstat.c                                        |    2 -
 10 files changed, 1095 insertions(+), 302 deletions(-)
 create mode 100644 Documentation/trace/postprocess/trace-vmscan-postprocess.pl
 create mode 100644 include/trace/events/gfpflags.h
 create mode 100644 include/trace/events/vmscan.h

diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
new file mode 100644
index 0000000..d1ddc33
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -0,0 +1,654 @@
+#!/usr/bin/perl
+# This is a POC for reading the text representation of trace output related to
+# page reclaim. It makes an attempt to extract some high-level information on
+# what is going on. The accuracy of the parser may vary
+#
+# Example usage: trace-vmscan-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
+# other options
+#   --read-procstat	If the trace lacks process info, get it from /proc
+#   --ignore-pid	Aggregate processes of the same name together
+#
+# Copyright (c) IBM Corporation 2009
+# Author: Mel Gorman <mel@csn.ul.ie>
+use strict;
+use Getopt::Long;
+
+# Tracepoint events
+use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN	=> 1;
+use constant MM_VMSCAN_DIRECT_RECLAIM_END	=> 2;
+use constant MM_VMSCAN_KSWAPD_WAKE		=> 3;
+use constant MM_VMSCAN_KSWAPD_SLEEP		=> 4;
+use constant MM_VMSCAN_LRU_SHRINK_ACTIVE	=> 5;
+use constant MM_VMSCAN_LRU_SHRINK_INACTIVE	=> 6;
+use constant MM_VMSCAN_LRU_ISOLATE		=> 7;
+use constant MM_VMSCAN_WRITEPAGE_SYNC		=> 8;
+use constant MM_VMSCAN_WRITEPAGE_ASYNC		=> 9;
+use constant EVENT_UNKNOWN			=> 10;
+
+# Per-order events
+use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER => 11;
+use constant MM_VMSCAN_WAKEUP_KSWAPD_PERORDER 	=> 12;
+use constant MM_VMSCAN_KSWAPD_WAKE_PERORDER	=> 13;
+use constant HIGH_KSWAPD_REWAKEUP_PERORDER	=> 14;
+
+# Constants used to track state
+use constant STATE_DIRECT_BEGIN 		=> 15;
+use constant STATE_DIRECT_ORDER 		=> 16;
+use constant STATE_KSWAPD_BEGIN			=> 17;
+use constant STATE_KSWAPD_ORDER			=> 18;
+
+# High-level events extrapolated from tracepoints
+use constant HIGH_DIRECT_RECLAIM_LATENCY	=> 19;
+use constant HIGH_KSWAPD_LATENCY		=> 20;
+use constant HIGH_KSWAPD_REWAKEUP		=> 21;
+use constant HIGH_NR_SCANNED			=> 22;
+use constant HIGH_NR_TAKEN			=> 23;
+use constant HIGH_NR_RECLAIM			=> 24;
+use constant HIGH_NR_CONTIG_DIRTY		=> 25;
+
+my %perprocesspid;
+my %perprocess;
+my %last_procmap;
+my $opt_ignorepid;
+my $opt_read_procstat;
+
+my $total_wakeup_kswapd;
+my ($total_direct_reclaim, $total_direct_nr_scanned);
+my ($total_direct_latency, $total_kswapd_latency);
+my ($total_direct_writepage_sync, $total_direct_writepage_async);
+my ($total_kswapd_nr_scanned, $total_kswapd_wake);
+my ($total_kswapd_writepage_sync, $total_kswapd_writepage_async);
+
+# Catch sigint and exit on request
+my $sigint_report = 0;
+my $sigint_exit = 0;
+my $sigint_pending = 0;
+my $sigint_received = 0;
+sub sigint_handler {
+	my $current_time = time;
+	if ($current_time - 2 > $sigint_received) {
+		print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
+		$sigint_report = 1;
+	} else {
+		if (!$sigint_exit) {
+			print "Second SIGINT received quickly, exiting\n";
+		}
+		$sigint_exit++;
+	}
+
+	if ($sigint_exit > 3) {
+		print "Many SIGINTs received, exiting now without report\n";
+		exit;
+	}
+
+	$sigint_received = $current_time;
+	$sigint_pending = 1;
+}
+$SIG{INT} = "sigint_handler";
+
+# Parse command line options
+GetOptions(
+	'ignore-pid'	 =>	\$opt_ignorepid,
+	'read-procstat'	 =>	\$opt_read_procstat,
+);
+
+# Defaults for dynamically discovered regex's
+my $regex_direct_begin_default = 'order=([0-9]*) may_writepage=([0-9]*) gfp_flags=([A-Z_|]*)';
+my $regex_direct_end_default = 'nr_reclaimed=([0-9]*)';
+my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)';
+my $regex_kswapd_sleep_default = 'nid=([0-9]*)';
+my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
+my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)';
+my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)';
+my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
+my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) sync_io=([0-9]*)';
+
+# Dyanically discovered regex
+my $regex_direct_begin;
+my $regex_direct_end;
+my $regex_kswapd_wake;
+my $regex_kswapd_sleep;
+my $regex_wakeup_kswapd;
+my $regex_lru_isolate;
+my $regex_lru_shrink_inactive;
+my $regex_lru_shrink_active;
+my $regex_writepage;
+
+# Static regex used. Specified like this for readability and for use with /o
+#                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
+my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
+my $regex_statname = '[-0-9]*\s\((.*)\).*';
+my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
+
+sub generate_traceevent_regex {
+	my $event = shift;
+	my $default = shift;
+	my $regex;
+
+	# Read the event format or use the default
+	if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
+		print("WARNING: Event $event format string not found\n");
+		return $default;
+	} else {
+		my $line;
+		while (!eof(FORMAT)) {
+			$line = <FORMAT>;
+			$line =~ s/, REC->.*//;
+			if ($line =~ /^print fmt:\s"(.*)".*/) {
+				$regex = $1;
+				$regex =~ s/%s/\([0-9a-zA-Z|_]*\)/g;
+				$regex =~ s/%p/\([0-9a-f]*\)/g;
+				$regex =~ s/%d/\([-0-9]*\)/g;
+				$regex =~ s/%ld/\([-0-9]*\)/g;
+				$regex =~ s/%lu/\([0-9]*\)/g;
+			}
+		}
+	}
+
+	# Can't handle the print_flags stuff but in the context of this
+	# script, it really doesn't matter
+	$regex =~ s/\(REC.*\) \? __print_flags.*//;
+
+	# Verify fields are in the right order
+	my $tuple;
+	foreach $tuple (split /\s/, $regex) {
+		my ($key, $value) = split(/=/, $tuple);
+		my $expected = shift;
+		if ($key ne $expected) {
+			print("WARNING: Format not as expected for event $event '$key' != '$expected'\n");
+			$regex =~ s/$key=\((.*)\)/$key=$1/;
+		}
+	}
+
+	if (defined shift) {
+		die("Fewer fields than expected in format");
+	}
+
+	return $regex;
+}
+
+$regex_direct_begin = generate_traceevent_regex(
+			"vmscan/mm_vmscan_direct_reclaim_begin",
+			$regex_direct_begin_default,
+			"order", "may_writepage",
+			"gfp_flags");
+$regex_direct_end = generate_traceevent_regex(
+			"vmscan/mm_vmscan_direct_reclaim_end",
+			$regex_direct_end_default,
+			"nr_reclaimed");
+$regex_kswapd_wake = generate_traceevent_regex(
+			"vmscan/mm_vmscan_kswapd_wake",
+			$regex_kswapd_wake_default,
+			"nid", "order");
+$regex_kswapd_sleep = generate_traceevent_regex(
+			"vmscan/mm_vmscan_kswapd_sleep",
+			$regex_kswapd_sleep_default,
+			"nid");
+$regex_wakeup_kswapd = generate_traceevent_regex(
+			"vmscan/mm_vmscan_wakeup_kswapd",
+			$regex_wakeup_kswapd_default,
+			"nid", "zid", "order");
+$regex_lru_isolate = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_isolate",
+			$regex_lru_isolate_default,
+			"isolate_mode", "order",
+			"nr_requested", "nr_scanned", "nr_taken",
+			"contig_taken", "contig_dirty", "contig_failed");
+$regex_lru_shrink_inactive = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_shrink_inactive",
+			$regex_lru_shrink_inactive_default,
+			"nid", "zid",
+			"lru",
+			"nr_scanned", "nr_reclaimed", "priority");
+$regex_lru_shrink_active = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_shrink_active",
+			$regex_lru_shrink_active_default,
+			"nid", "zid",
+			"lru",
+			"nr_scanned", "nr_rotated", "priority");
+$regex_writepage = generate_traceevent_regex(
+			"vmscan/mm_vmscan_writepage",
+			$regex_writepage_default,
+			"page", "pfn", "sync_io");
+
+sub read_statline($) {
+	my $pid = $_[0];
+	my $statline;
+
+	if (open(STAT, "/proc/$pid/stat")) {
+		$statline = <STAT>;
+		close(STAT);
+	}
+
+	if ($statline eq '') {
+		$statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
+	}
+
+	return $statline;
+}
+
+sub guess_process_pid($$) {
+	my $pid = $_[0];
+	my $statline = $_[1];
+
+	if ($pid == 0) {
+		return "swapper-0";
+	}
+
+	if ($statline !~ /$regex_statname/o) {
+		die("Failed to math stat line for process name :: $statline");
+	}
+	return "$1-$pid";
+}
+
+# Convert sec.usec timestamp format
+sub timestamp_to_ms($) {
+	my $timestamp = $_[0];
+
+	my ($sec, $usec) = split (/\./, $timestamp);
+	return ($sec * 1000) + ($usec / 1000);
+}
+
+sub process_events {
+	my $traceevent;
+	my $process_pid;
+	my $cpus;
+	my $timestamp;
+	my $tracepoint;
+	my $details;
+	my $statline;
+
+	# Read each line of the event log
+EVENT_PROCESS:
+	while ($traceevent = <STDIN>) {
+		if ($traceevent =~ /$regex_traceevent/o) {
+			$process_pid = $1;
+			$timestamp = $3;
+			$tracepoint = $4;
+
+			$process_pid =~ /(.*)-([0-9]*)$/;
+			my $process = $1;
+			my $pid = $2;
+
+			if ($process eq "") {
+				$process = $last_procmap{$pid};
+				$process_pid = "$process-$pid";
+			}
+			$last_procmap{$pid} = $process;
+
+			if ($opt_read_procstat) {
+				$statline = read_statline($pid);
+				if ($opt_read_procstat && $process eq '') {
+					$process_pid = guess_process_pid($pid, $statline);
+				}
+			}
+		} else {
+			next;
+		}
+
+		# Perl Switch() sucks majorly
+		if ($tracepoint eq "mm_vmscan_direct_reclaim_begin") {
+			$timestamp = timestamp_to_ms($timestamp);
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}++;
+			$perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN} = $timestamp;
+
+			$details = $5;
+			if ($details !~ /$regex_direct_begin/o) {
+				print "WARNING: Failed to parse mm_vmscan_direct_reclaim_begin as expected\n";
+				print "         $details\n";
+				print "         $regex_direct_begin\n";
+				next;
+			}
+			my $order = $1;
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]++;
+			$perprocesspid{$process_pid}->{STATE_DIRECT_ORDER} = $order;
+		} elsif ($tracepoint eq "mm_vmscan_direct_reclaim_end") {
+			# Count the event itself
+			my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END};
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END}++;
+
+			# Record how long direct reclaim took this time
+			if (defined $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN}) {
+				$timestamp = timestamp_to_ms($timestamp);
+				my $order = $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER};
+				my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN});
+				$perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] = "$order-$latency";
+			}
+		} elsif ($tracepoint eq "mm_vmscan_kswapd_wake") {
+			$details = $5;
+			if ($details !~ /$regex_kswapd_wake/o) {
+				print "WARNING: Failed to parse mm_vmscan_kswapd_wake as expected\n";
+				print "         $details\n";
+				print "         $regex_kswapd_wake\n";
+				next;
+			}
+
+			my $order = $2;
+			$perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER} = $order;
+			if (!$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN}) {
+				$timestamp = timestamp_to_ms($timestamp);
+				$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}++;
+				$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = $timestamp;
+				$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]++;
+			} else {
+				$perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}++;
+				$perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order]++;
+			}
+		} elsif ($tracepoint eq "mm_vmscan_kswapd_sleep") {
+
+			# Count the event itself
+			my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP};
+			$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP}++;
+
+			# Record how long kswapd was awake
+			$timestamp = timestamp_to_ms($timestamp);
+			my $order = $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER};
+			my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN});
+			$perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index] = "$order-$latency";
+			$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = 0;
+		} elsif ($tracepoint eq "mm_vmscan_wakeup_kswapd") {
+			$perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}++;
+
+			$details = $5;
+			if ($details !~ /$regex_wakeup_kswapd/o) {
+				print "WARNING: Failed to parse mm_vmscan_wakeup_kswapd as expected\n";
+				print "         $details\n";
+				print "         $regex_wakeup_kswapd\n";
+				next;
+			}
+			my $order = $3;
+			$perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]++;
+		} elsif ($tracepoint eq "mm_vmscan_lru_isolate") {
+			$details = $5;
+			if ($details !~ /$regex_lru_isolate/o) {
+				print "WARNING: Failed to parse mm_vmscan_lru_isolate as expected\n";
+				print "         $details\n";
+				print "         $regex_lru_isolate/o\n";
+				next;
+			}
+			my $nr_scanned = $4;
+			my $nr_contig_dirty = $7;
+			$perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
+			$perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty;
+		} elsif ($tracepoint eq "mm_vmscan_writepage") {
+			$details = $5;
+			if ($details !~ /$regex_writepage/o) {
+				print "WARNING: Failed to parse mm_vmscan_writepage as expected\n";
+				print "         $details\n";
+				print "         $regex_writepage\n";
+				next;
+			}
+
+			my $sync_io = $3;
+			if ($sync_io) {
+				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}++;
+			} else {
+				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}++;
+			}
+		} else {
+			$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
+		}
+
+		if ($sigint_pending) {
+			last EVENT_PROCESS;
+		}
+	}
+}
+
+sub dump_stats {
+	my $hashref = shift;
+	my %stats = %$hashref;
+
+	# Dump per-process stats
+	my $process_pid;
+	my $max_strlen = 0;
+
+	# Get the maximum process name
+	foreach $process_pid (keys %perprocesspid) {
+		my $len = length($process_pid);
+		if ($len > $max_strlen) {
+			$max_strlen = $len;
+		}
+	}
+	$max_strlen += 2;
+
+	# Work out latencies
+	printf("\n") if !$opt_ignorepid;
+	printf("Reclaim latencies expressed as order-latency_in_ms\n") if !$opt_ignorepid;
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[0] &&
+				!$stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[0]) {
+			next;
+		}
+
+		printf "%-" . $max_strlen . "s ", $process_pid if !$opt_ignorepid;
+		my $index = 0;
+		while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] ||
+			defined $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) {
+
+			if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
+				printf("%s ", $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) if !$opt_ignorepid;
+				my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
+				$total_direct_latency += $latency;
+			} else {
+				printf("%s ", $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) if !$opt_ignorepid;
+				my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]);
+				$total_kswapd_latency += $latency;
+			}
+			$index++;
+		}
+		print "\n" if !$opt_ignorepid;
+	}
+
+	# Print out process activity
+	printf("\n");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "Process", "Direct",  "Wokeup", "Pages",   "Pages",   "Pages",     "Time");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "details", "Rclms",   "Kswapd", "Scanned", "Sync-IO", "ASync-IO",  "Stalled");
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
+			next;
+		}
+
+		$total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
+		$total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
+		$total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_direct_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$total_direct_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		my $index = 0;
+		my $this_reclaim_delay = 0;
+		while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
+			 my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
+			$this_reclaim_delay += $latency;
+			$index++;
+		}
+
+		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8u %8u %8.3f",
+			$process_pid,
+			$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
+			$stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
+			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC},
+			$this_reclaim_delay / 1000);
+
+		if ($stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
+				if ($count != 0) {
+					print "direct-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
+				if ($count != 0) {
+					print "wakeup-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY}) {
+			print "      ";
+			my $count = $stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY};
+			if ($count != 0) {
+				print "contig-dirty=$count ";
+			}
+		}
+
+		print "\n";
+	}
+
+	# Print out kswapd activity
+	printf("\n");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Kswapd",   "Kswapd",  "Order",     "Pages",   "Pages",  "Pages");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Sync-IO", "ASync-IO");
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
+			next;
+		}
+
+		$total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
+		$total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_kswapd_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$total_kswapd_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8i %8u",
+			$process_pid,
+			$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
+			$stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
+			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC});
+
+		if ($stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
+				if ($count != 0) {
+					print "wake-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order];
+				if ($count != 0) {
+					print "rewake-$order=$count ";
+				}
+			}
+		}
+		printf("\n");
+	}
+
+	# Print out summaries
+	$total_direct_latency /= 1000;
+	$total_kswapd_latency /= 1000;
+	print "\nSummary\n";
+	print "Direct reclaims:     		$total_direct_reclaim\n";
+	print "Direct reclaim pages scanned:	$total_direct_nr_scanned\n";
+	print "Direct reclaim write sync I/O:	$total_direct_writepage_sync\n";
+	print "Direct reclaim write async I/O:	$total_direct_writepage_async\n";
+	print "Wake kswapd requests:		$total_wakeup_kswapd\n";
+	printf "Time stalled direct reclaim: 	%-1.2f ms\n", $total_direct_latency;
+	print "\n";
+	print "Kswapd wakeups:			$total_kswapd_wake\n";
+	print "Kswapd pages scanned:		$total_kswapd_nr_scanned\n";
+	print "Kswapd reclaim write sync I/O:	$total_kswapd_writepage_sync\n";
+	print "Kswapd reclaim write async I/O:	$total_kswapd_writepage_async\n";
+	printf "Time kswapd awake:		%-1.2f ms\n", $total_kswapd_latency;
+}
+
+sub aggregate_perprocesspid() {
+	my $process_pid;
+	my $process;
+	undef %perprocess;
+
+	foreach $process_pid (keys %perprocesspid) {
+		$process = $process_pid;
+		$process =~ s/-([0-9])*$//;
+		if ($process eq '') {
+			$process = "NO_PROCESS_NAME";
+		}
+
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN} += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
+		$perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE} += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
+		$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
+		$perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
+		$perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		for (my $order = 0; $order < 20; $order++) {
+			$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
+			$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
+			$perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
+
+		}
+
+		# Aggregate direct reclaim latencies
+		my $wr_index = $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END};
+		my $rd_index = 0;
+		while (defined $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index]) {
+			$perprocess{$process}->{HIGH_DIRECT_RECLAIM_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index];
+			$rd_index++;
+			$wr_index++;
+		}
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
+
+		# Aggregate kswapd latencies
+		my $wr_index = $perprocess{$process}->{MM_VMSCAN_KSWAPD_SLEEP};
+		my $rd_index = 0;
+		while (defined $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index]) {
+			$perprocess{$process}->{HIGH_KSWAPD_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index];
+			$rd_index++;
+			$wr_index++;
+		}
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
+	}
+}
+
+sub report() {
+	if (!$opt_ignorepid) {
+		dump_stats(\%perprocesspid);
+	} else {
+		aggregate_perprocesspid();
+		dump_stats(\%perprocess);
+	}
+}
+
+# Process events or signals until neither is available
+sub signal_loop() {
+	my $sigint_processed;
+	do {
+		$sigint_processed = 0;
+		process_events();
+
+		# Handle pending signals if any
+		if ($sigint_pending) {
+			my $current_time = time;
+
+			if ($sigint_exit) {
+				print "Received exit signal\n";
+				$sigint_pending = 0;
+			}
+			if ($sigint_report) {
+				if ($current_time >= $sigint_received + 2) {
+					report();
+					$sigint_report = 0;
+					$sigint_pending = 0;
+					$sigint_processed = 1;
+				}
+			}
+		}
+	} while ($sigint_pending || $sigint_processed);
+}
+
+signal_loop();
+report();
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9411d32..9f1afd3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -98,11 +98,6 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
 /*
  * For memory reclaim.
  */
-extern int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem);
-extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
-							int priority);
-extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
-							int priority);
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4d109e..b578eee 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -348,21 +348,6 @@ struct zone {
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 
 	/*
-	 * prev_priority holds the scanning priority for this zone.  It is
-	 * defined as the scanning priority at which we achieved our reclaim
-	 * target at the previous try_to_free_pages() or balance_pgdat()
-	 * invocation.
-	 *
-	 * We use prev_priority as a measure of how much stress page reclaim is
-	 * under - it drives the swappiness decision: whether to unmap mapped
-	 * pages.
-	 *
-	 * Access to both this field is quite racy even on uniprocessor.  But
-	 * it is expected to average out OK.
-	 */
-	int prev_priority;
-
-	/*
 	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
 	 * this zone's LRU.  Maintained by the pageout code.
 	 */
diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
new file mode 100644
index 0000000..e3615c0
--- /dev/null
+++ b/include/trace/events/gfpflags.h
@@ -0,0 +1,37 @@
+/*
+ * The order of these masks is important. Matching masks will be seen
+ * first and the left over flags will end up showing by themselves.
+ *
+ * For example, if we have GFP_KERNEL before GFP_USER we wil get:
+ *
+ *  GFP_KERNEL|GFP_HARDWALL
+ *
+ * Thus most bits set go first.
+ */
+#define show_gfp_flags(flags)						\
+	(flags) ? __print_flags(flags, "|",				\
+	{(unsigned long)GFP_HIGHUSER_MOVABLE,	"GFP_HIGHUSER_MOVABLE"}, \
+	{(unsigned long)GFP_HIGHUSER,		"GFP_HIGHUSER"},	\
+	{(unsigned long)GFP_USER,		"GFP_USER"},		\
+	{(unsigned long)GFP_TEMPORARY,		"GFP_TEMPORARY"},	\
+	{(unsigned long)GFP_KERNEL,		"GFP_KERNEL"},		\
+	{(unsigned long)GFP_NOFS,		"GFP_NOFS"},		\
+	{(unsigned long)GFP_ATOMIC,		"GFP_ATOMIC"},		\
+	{(unsigned long)GFP_NOIO,		"GFP_NOIO"},		\
+	{(unsigned long)__GFP_HIGH,		"GFP_HIGH"},		\
+	{(unsigned long)__GFP_WAIT,		"GFP_WAIT"},		\
+	{(unsigned long)__GFP_IO,		"GFP_IO"},		\
+	{(unsigned long)__GFP_COLD,		"GFP_COLD"},		\
+	{(unsigned long)__GFP_NOWARN,		"GFP_NOWARN"},		\
+	{(unsigned long)__GFP_REPEAT,		"GFP_REPEAT"},		\
+	{(unsigned long)__GFP_NOFAIL,		"GFP_NOFAIL"},		\
+	{(unsigned long)__GFP_NORETRY,		"GFP_NORETRY"},		\
+	{(unsigned long)__GFP_COMP,		"GFP_COMP"},		\
+	{(unsigned long)__GFP_ZERO,		"GFP_ZERO"},		\
+	{(unsigned long)__GFP_NOMEMALLOC,	"GFP_NOMEMALLOC"},	\
+	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
+	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
+	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
+	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"}		\
+	) : "GFP_NOWAIT"
+
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 3adca0c..a9c87ad 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -6,43 +6,7 @@
 
 #include <linux/types.h>
 #include <linux/tracepoint.h>
-
-/*
- * The order of these masks is important. Matching masks will be seen
- * first and the left over flags will end up showing by themselves.
- *
- * For example, if we have GFP_KERNEL before GFP_USER we wil get:
- *
- *  GFP_KERNEL|GFP_HARDWALL
- *
- * Thus most bits set go first.
- */
-#define show_gfp_flags(flags)						\
-	(flags) ? __print_flags(flags, "|",				\
-	{(unsigned long)GFP_HIGHUSER_MOVABLE,	"GFP_HIGHUSER_MOVABLE"}, \
-	{(unsigned long)GFP_HIGHUSER,		"GFP_HIGHUSER"},	\
-	{(unsigned long)GFP_USER,		"GFP_USER"},		\
-	{(unsigned long)GFP_TEMPORARY,		"GFP_TEMPORARY"},	\
-	{(unsigned long)GFP_KERNEL,		"GFP_KERNEL"},		\
-	{(unsigned long)GFP_NOFS,		"GFP_NOFS"},		\
-	{(unsigned long)GFP_ATOMIC,		"GFP_ATOMIC"},		\
-	{(unsigned long)GFP_NOIO,		"GFP_NOIO"},		\
-	{(unsigned long)__GFP_HIGH,		"GFP_HIGH"},		\
-	{(unsigned long)__GFP_WAIT,		"GFP_WAIT"},		\
-	{(unsigned long)__GFP_IO,		"GFP_IO"},		\
-	{(unsigned long)__GFP_COLD,		"GFP_COLD"},		\
-	{(unsigned long)__GFP_NOWARN,		"GFP_NOWARN"},		\
-	{(unsigned long)__GFP_REPEAT,		"GFP_REPEAT"},		\
-	{(unsigned long)__GFP_NOFAIL,		"GFP_NOFAIL"},		\
-	{(unsigned long)__GFP_NORETRY,		"GFP_NORETRY"},		\
-	{(unsigned long)__GFP_COMP,		"GFP_COMP"},		\
-	{(unsigned long)__GFP_ZERO,		"GFP_ZERO"},		\
-	{(unsigned long)__GFP_NOMEMALLOC,	"GFP_NOMEMALLOC"},	\
-	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
-	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
-	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
-	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"}		\
-	) : "GFP_NOWAIT"
+#include "gfpflags.h"
 
 DECLARE_EVENT_CLASS(kmem_alloc,
 
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
new file mode 100644
index 0000000..f2da66a
--- /dev/null
+++ b/include/trace/events/vmscan.h
@@ -0,0 +1,184 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vmscan
+
+#if !defined(_TRACE_VMSCAN_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VMSCAN_H
+
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+#include "gfpflags.h"
+
+TRACE_EVENT(mm_vmscan_kswapd_sleep,
+
+	TP_PROTO(int nid),
+
+	TP_ARGS(nid),
+
+	TP_STRUCT__entry(
+		__field(	int,	nid	)
+	),
+
+	TP_fast_assign(
+		__entry->nid	= nid;
+	),
+
+	TP_printk("nid=%d", __entry->nid)
+);
+
+TRACE_EVENT(mm_vmscan_kswapd_wake,
+
+	TP_PROTO(int nid, int order),
+
+	TP_ARGS(nid, order),
+
+	TP_STRUCT__entry(
+		__field(	int,	nid	)
+		__field(	int,	order	)
+	),
+
+	TP_fast_assign(
+		__entry->nid	= nid;
+		__entry->order	= order;
+	),
+
+	TP_printk("nid=%d order=%d", __entry->nid, __entry->order)
+);
+
+TRACE_EVENT(mm_vmscan_wakeup_kswapd,
+
+	TP_PROTO(int nid, int zid, int order),
+
+	TP_ARGS(nid, zid, order),
+
+	TP_STRUCT__entry(
+		__field(	int,		nid	)
+		__field(	int,		zid	)
+		__field(	int,		order	)
+	),
+
+	TP_fast_assign(
+		__entry->nid		= nid;
+		__entry->zid		= zid;
+		__entry->order		= order;
+	),
+
+	TP_printk("nid=%d zid=%d order=%d",
+		__entry->nid,
+		__entry->zid,
+		__entry->order)
+);
+
+TRACE_EVENT(mm_vmscan_direct_reclaim_begin,
+
+	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags),
+
+	TP_ARGS(order, may_writepage, gfp_flags),
+
+	TP_STRUCT__entry(
+		__field(	int,	order		)
+		__field(	int,	may_writepage	)
+		__field(	gfp_t,	gfp_flags	)
+	),
+
+	TP_fast_assign(
+		__entry->order		= order;
+		__entry->may_writepage	= may_writepage;
+		__entry->gfp_flags	= gfp_flags;
+	),
+
+	TP_printk("order=%d may_writepage=%d gfp_flags=%s",
+		__entry->order,
+		__entry->may_writepage,
+		show_gfp_flags(__entry->gfp_flags))
+);
+
+TRACE_EVENT(mm_vmscan_direct_reclaim_end,
+
+	TP_PROTO(unsigned long nr_reclaimed),
+
+	TP_ARGS(nr_reclaimed),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	nr_reclaimed	)
+	),
+
+	TP_fast_assign(
+		__entry->nr_reclaimed	= nr_reclaimed;
+	),
+
+	TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed)
+);
+
+TRACE_EVENT(mm_vmscan_lru_isolate,
+
+	TP_PROTO(int order,
+		unsigned long nr_requested,
+		unsigned long nr_scanned,
+		unsigned long nr_taken,
+		unsigned long nr_lumpy_taken,
+		unsigned long nr_lumpy_dirty,
+		unsigned long nr_lumpy_failed,
+		int isolate_mode),
+
+	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, isolate_mode),
+
+	TP_STRUCT__entry(
+		__field(int, order)
+		__field(unsigned long, nr_requested)
+		__field(unsigned long, nr_scanned)
+		__field(unsigned long, nr_taken)
+		__field(unsigned long, nr_lumpy_taken)
+		__field(unsigned long, nr_lumpy_dirty)
+		__field(unsigned long, nr_lumpy_failed)
+		__field(int, isolate_mode)
+	),
+
+	TP_fast_assign(
+		__entry->order = order;
+		__entry->nr_requested = nr_requested;
+		__entry->nr_scanned = nr_scanned;
+		__entry->nr_taken = nr_taken;
+		__entry->nr_lumpy_taken = nr_lumpy_taken;
+		__entry->nr_lumpy_dirty = nr_lumpy_dirty;
+		__entry->nr_lumpy_failed = nr_lumpy_failed;
+		__entry->isolate_mode = isolate_mode;
+	),
+
+	TP_printk("isolate_mode=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu contig_taken=%lu contig_dirty=%lu contig_failed=%lu",
+		__entry->isolate_mode,
+		__entry->order,
+		__entry->nr_requested,
+		__entry->nr_scanned,
+		__entry->nr_taken,
+		__entry->nr_lumpy_taken,
+		__entry->nr_lumpy_dirty,
+		__entry->nr_lumpy_failed)
+);
+
+TRACE_EVENT(mm_vmscan_writepage,
+
+	TP_PROTO(struct page *page,
+		int sync_io),
+
+	TP_ARGS(page, sync_io),
+
+	TP_STRUCT__entry(
+		__field(struct page *, page)
+		__field(int, sync_io)
+	),
+
+	TP_fast_assign(
+		__entry->page = page;
+		__entry->sync_io = sync_io;
+	),
+
+	TP_printk("page=%p pfn=%lu sync_io=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->sync_io)
+);
+
+#endif /* _TRACE_VMSCAN_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 20a8193..31abd1c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -211,8 +211,6 @@ struct mem_cgroup {
 	*/
 	spinlock_t reclaim_param_lock;
 
-	int	prev_priority;	/* for recording reclaim priority */
-
 	/*
 	 * While reclaiming in a hierarchy, we cache the last child we
 	 * reclaimed from.
@@ -858,35 +856,6 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
 	return ret;
 }
 
-/*
- * prev_priority control...this will be used in memory reclaim path.
- */
-int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem)
-{
-	int prev_priority;
-
-	spin_lock(&mem->reclaim_param_lock);
-	prev_priority = mem->prev_priority;
-	spin_unlock(&mem->reclaim_param_lock);
-
-	return prev_priority;
-}
-
-void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem, int priority)
-{
-	spin_lock(&mem->reclaim_param_lock);
-	if (priority < mem->prev_priority)
-		mem->prev_priority = priority;
-	spin_unlock(&mem->reclaim_param_lock);
-}
-
-void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, int priority)
-{
-	spin_lock(&mem->reclaim_param_lock);
-	mem->prev_priority = priority;
-	spin_unlock(&mem->reclaim_param_lock);
-}
-
 static int calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_pages)
 {
 	unsigned long active;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9bd339e..eefc8b5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4089,8 +4089,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		zone_seqlock_init(zone);
 		zone->zone_pgdat = pgdat;
 
-		zone->prev_priority = DEF_PRIORITY;
-
 		zone_pcp_init(zone);
 		for_each_lru(l) {
 			INIT_LIST_HEAD(&zone->lru[l].list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b94fe1b..63447ff 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -48,6 +48,9 @@
 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/vmscan.h>
+
 struct scan_control {
 	/* Incremented by the number of inactive pages that were scanned */
 	unsigned long nr_scanned;
@@ -398,6 +401,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			/* synchronous write or broken a_ops? */
 			ClearPageReclaim(page);
 		}
+		trace_mm_vmscan_writepage(page,
+			sync_writeback == PAGEOUT_IO_SYNC);
 		inc_zone_page_state(page, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
 	}
@@ -617,6 +622,24 @@ static enum page_references page_check_references(struct page *page,
 	return PAGEREF_RECLAIM;
 }
 
+static noinline_for_stack void free_page_list(struct list_head *free_pages)
+{
+	struct pagevec freed_pvec;
+	struct page *page, *tmp;
+
+	pagevec_init(&freed_pvec, 1);
+
+	list_for_each_entry_safe(page, tmp, free_pages, lru) {
+		list_del(&page->lru);
+		if (!pagevec_add(&freed_pvec, page)) {
+			__pagevec_free(&freed_pvec);
+			pagevec_reinit(&freed_pvec);
+		}
+	}
+
+	pagevec_free(&freed_pvec);
+}
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -625,13 +648,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 					enum pageout_io sync_writeback)
 {
 	LIST_HEAD(ret_pages);
-	struct pagevec freed_pvec;
+	LIST_HEAD(free_pages);
 	int pgactivate = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
 
-	pagevec_init(&freed_pvec, 1);
 	while (!list_empty(page_list)) {
 		enum page_references references;
 		struct address_space *mapping;
@@ -806,10 +828,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		__clear_page_locked(page);
 free_it:
 		nr_reclaimed++;
-		if (!pagevec_add(&freed_pvec, page)) {
-			__pagevec_free(&freed_pvec);
-			pagevec_reinit(&freed_pvec);
-		}
+
+		/*
+		 * Is there need to periodically free_page_list? It would
+		 * appear not as the counts should be low
+		 */
+		list_add(&page->lru, &free_pages);
 		continue;
 
 cull_mlocked:
@@ -832,9 +856,10 @@ keep:
 		list_add(&page->lru, &ret_pages);
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
+
+	free_page_list(&free_pages);
+
 	list_splice(&ret_pages, page_list);
-	if (pagevec_count(&freed_pvec))
-		__pagevec_free(&freed_pvec);
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
@@ -916,6 +941,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		unsigned long *scanned, int order, int mode, int file)
 {
 	unsigned long nr_taken = 0;
+	unsigned long nr_lumpy_taken = 0;
+	unsigned long nr_lumpy_dirty = 0;
+	unsigned long nr_lumpy_failed = 0;
 	unsigned long scan;
 
 	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
@@ -993,12 +1021,25 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 				list_move(&cursor_page->lru, dst);
 				mem_cgroup_del_lru(cursor_page);
 				nr_taken++;
+				nr_lumpy_taken++;
+				if (PageDirty(cursor_page))
+					nr_lumpy_dirty++;
 				scan++;
+			} else {
+				if (mode == ISOLATE_BOTH &&
+						page_count(cursor_page))
+					nr_lumpy_failed++;
 			}
 		}
 	}
 
 	*scanned = scan;
+
+	trace_mm_vmscan_lru_isolate(order,
+			nr_to_scan, scan,
+			nr_taken,
+			nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed,
+			mode);
 	return nr_taken;
 }
 
@@ -1035,7 +1076,8 @@ static unsigned long clear_active_flags(struct list_head *page_list,
 			ClearPageActive(page);
 			nr_active++;
 		}
-		count[lru]++;
+		if (count)
+			count[lru]++;
 	}
 
 	return nr_active;
@@ -1112,174 +1154,177 @@ static int too_many_isolated(struct zone *zone, int file,
 }
 
 /*
- * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
- * of reclaimed pages
+ * TODO: Try merging with migrations version of putback_lru_pages
  */
-static unsigned long shrink_inactive_list(unsigned long max_scan,
-			struct zone *zone, struct scan_control *sc,
-			int priority, int file)
+static noinline_for_stack void
+putback_lru_pages(struct zone *zone, struct scan_control *sc,
+				unsigned long nr_anon, unsigned long nr_file,
+				struct list_head *page_list)
 {
-	LIST_HEAD(page_list);
+	struct page *page;
 	struct pagevec pvec;
-	unsigned long nr_scanned = 0;
-	unsigned long nr_reclaimed = 0;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 
-	while (unlikely(too_many_isolated(zone, file, sc))) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+	pagevec_init(&pvec, 1);
 
-		/* We are about to die and free our memory. Return now. */
-		if (fatal_signal_pending(current))
-			return SWAP_CLUSTER_MAX;
+	/*
+	 * Put back any unfreeable pages.
+	 */
+	spin_lock(&zone->lru_lock);
+	while (!list_empty(page_list)) {
+		int lru;
+		page = lru_to_page(page_list);
+		VM_BUG_ON(PageLRU(page));
+		list_del(&page->lru);
+		if (unlikely(!page_evictable(page, NULL))) {
+			spin_unlock_irq(&zone->lru_lock);
+			putback_lru_page(page);
+			spin_lock_irq(&zone->lru_lock);
+			continue;
+		}
+		SetPageLRU(page);
+		lru = page_lru(page);
+		add_page_to_lru_list(zone, page, lru);
+		if (is_active_lru(lru)) {
+			int file = is_file_lru(lru);
+			reclaim_stat->recent_rotated[file]++;
+		}
+		if (!pagevec_add(&pvec, page)) {
+			spin_unlock_irq(&zone->lru_lock);
+			__pagevec_release(&pvec);
+			spin_lock_irq(&zone->lru_lock);
+		}
 	}
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
 
+	spin_unlock_irq(&zone->lru_lock);
+	pagevec_release(&pvec);
+}
 
-	pagevec_init(&pvec, 1);
+static noinline_for_stack void update_isolated_counts(struct zone *zone,
+					struct scan_control *sc,
+					unsigned long *nr_anon,
+					unsigned long *nr_file,
+					struct list_head *isolated_list)
+{
+	unsigned long nr_active;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 
-	lru_add_drain();
-	spin_lock_irq(&zone->lru_lock);
-	do {
-		struct page *page;
-		unsigned long nr_taken;
-		unsigned long nr_scan;
-		unsigned long nr_freed;
-		unsigned long nr_active;
-		unsigned int count[NR_LRU_LISTS] = { 0, };
-		int mode = sc->lumpy_reclaim_mode ? ISOLATE_BOTH : ISOLATE_INACTIVE;
-		unsigned long nr_anon;
-		unsigned long nr_file;
+	nr_active = clear_active_flags(isolated_list, count);
+	__count_vm_events(PGDEACTIVATE, nr_active);
 
-		if (scanning_global_lru(sc)) {
-			nr_taken = isolate_pages_global(SWAP_CLUSTER_MAX,
-							&page_list, &nr_scan,
-							sc->order, mode,
-							zone, 0, file);
-			zone->pages_scanned += nr_scan;
-			if (current_is_kswapd())
-				__count_zone_vm_events(PGSCAN_KSWAPD, zone,
-						       nr_scan);
-			else
-				__count_zone_vm_events(PGSCAN_DIRECT, zone,
-						       nr_scan);
-		} else {
-			nr_taken = mem_cgroup_isolate_pages(SWAP_CLUSTER_MAX,
-							&page_list, &nr_scan,
-							sc->order, mode,
-							zone, sc->mem_cgroup,
-							0, file);
-			/*
-			 * mem_cgroup_isolate_pages() keeps track of
-			 * scanned pages on its own.
-			 */
-		}
+	__mod_zone_page_state(zone, NR_ACTIVE_FILE,
+			      -count[LRU_ACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_INACTIVE_FILE,
+			      -count[LRU_INACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_ACTIVE_ANON,
+			      -count[LRU_ACTIVE_ANON]);
+	__mod_zone_page_state(zone, NR_INACTIVE_ANON,
+			      -count[LRU_INACTIVE_ANON]);
 
-		if (nr_taken == 0)
-			goto done;
+	*nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+	*nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, *nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, *nr_file);
 
-		nr_active = clear_active_flags(&page_list, count);
-		__count_vm_events(PGDEACTIVATE, nr_active);
+	reclaim_stat->recent_scanned[0] += *nr_anon;
+	reclaim_stat->recent_scanned[1] += *nr_file;
+}
 
-		__mod_zone_page_state(zone, NR_ACTIVE_FILE,
-						-count[LRU_ACTIVE_FILE]);
-		__mod_zone_page_state(zone, NR_INACTIVE_FILE,
-						-count[LRU_INACTIVE_FILE]);
-		__mod_zone_page_state(zone, NR_ACTIVE_ANON,
-						-count[LRU_ACTIVE_ANON]);
-		__mod_zone_page_state(zone, NR_INACTIVE_ANON,
-						-count[LRU_INACTIVE_ANON]);
+/*
+ * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
+ * of reclaimed pages
+ */
+static noinline_for_stack unsigned long
+shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
+			struct scan_control *sc, int priority, int file)
+{
+	LIST_HEAD(page_list);
+	unsigned long nr_scanned;
+	unsigned long nr_reclaimed = 0;
+	unsigned long nr_taken;
+	unsigned long nr_active;
+	unsigned long nr_anon;
+	unsigned long nr_file;
 
-		nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
-		nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
-		__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
-		__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
+	while (unlikely(too_many_isolated(zone, file, sc))) {
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
-		reclaim_stat->recent_scanned[0] += nr_anon;
-		reclaim_stat->recent_scanned[1] += nr_file;
+		/* We are about to die and free our memory. Return now. */
+		if (fatal_signal_pending(current))
+			return SWAP_CLUSTER_MAX;
+	}
 
-		spin_unlock_irq(&zone->lru_lock);
 
-		nr_scanned += nr_scan;
-		nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+	lru_add_drain();
+	spin_lock_irq(&zone->lru_lock);
 
+	if (scanning_global_lru(sc)) {
+		nr_taken = isolate_pages_global(nr_to_scan,
+			&page_list, &nr_scanned, sc->order,
+			sc->lumpy_reclaim_mode ?
+				ISOLATE_BOTH : ISOLATE_INACTIVE,
+			zone, 0, file);
+		zone->pages_scanned += nr_scanned;
+		if (current_is_kswapd())
+			__count_zone_vm_events(PGSCAN_KSWAPD, zone,
+					       nr_scanned);
+		else
+			__count_zone_vm_events(PGSCAN_DIRECT, zone,
+					       nr_scanned);
+	} else {
+		nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
+			&page_list, &nr_scanned, sc->order,
+			sc->lumpy_reclaim_mode ?
+				ISOLATE_BOTH : ISOLATE_INACTIVE,
+			zone, sc->mem_cgroup,
+			0, file);
 		/*
-		 * If we are direct reclaiming for contiguous pages and we do
-		 * not reclaim everything in the list, try again and wait
-		 * for IO to complete. This will stall high-order allocations
-		 * but that should be acceptable to the caller
+		 * mem_cgroup_isolate_pages() keeps track of
+		 * scanned pages on its own.
 		 */
-		if (nr_freed < nr_taken && !current_is_kswapd() &&
-		    sc->lumpy_reclaim_mode) {
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+	}
 
-			/*
-			 * The attempt at page out may have made some
-			 * of the pages active, mark them inactive again.
-			 */
-			nr_active = clear_active_flags(&page_list, count);
-			count_vm_events(PGDEACTIVATE, nr_active);
+	if (nr_taken == 0) {
+		spin_unlock_irq(&zone->lru_lock);
+		return 0;
+	}
 
-			nr_freed += shrink_page_list(&page_list, sc,
-							PAGEOUT_IO_SYNC);
-		}
+	update_isolated_counts(zone, sc, &nr_anon, &nr_file, &page_list);
 
-		nr_reclaimed += nr_freed;
+	spin_unlock_irq(&zone->lru_lock);
 
-		local_irq_disable();
-		if (current_is_kswapd())
-			__count_vm_events(KSWAPD_STEAL, nr_freed);
-		__count_zone_vm_events(PGSTEAL, zone, nr_freed);
+	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+
+	/*
+	 * If we are direct reclaiming for contiguous pages and we do
+	 * not reclaim everything in the list, try again and wait
+	 * for IO to complete. This will stall high-order allocations
+	 * but that should be acceptable to the caller
+	 */
+	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
+			sc->lumpy_reclaim_mode) {
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
-		spin_lock(&zone->lru_lock);
 		/*
-		 * Put back any unfreeable pages.
+		 * The attempt at page out may have made some
+		 * of the pages active, mark them inactive again.
 		 */
-		while (!list_empty(&page_list)) {
-			int lru;
-			page = lru_to_page(&page_list);
-			VM_BUG_ON(PageLRU(page));
-			list_del(&page->lru);
-			if (unlikely(!page_evictable(page, NULL))) {
-				spin_unlock_irq(&zone->lru_lock);
-				putback_lru_page(page);
-				spin_lock_irq(&zone->lru_lock);
-				continue;
-			}
-			SetPageLRU(page);
-			lru = page_lru(page);
-			add_page_to_lru_list(zone, page, lru);
-			if (is_active_lru(lru)) {
-				int file = is_file_lru(lru);
-				reclaim_stat->recent_rotated[file]++;
-			}
-			if (!pagevec_add(&pvec, page)) {
-				spin_unlock_irq(&zone->lru_lock);
-				__pagevec_release(&pvec);
-				spin_lock_irq(&zone->lru_lock);
-			}
-		}
-		__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
-		__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
+		nr_active = clear_active_flags(&page_list, NULL);
+		count_vm_events(PGDEACTIVATE, nr_active);
 
-  	} while (nr_scanned < max_scan);
+		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+	}
 
-done:
-	spin_unlock_irq(&zone->lru_lock);
-	pagevec_release(&pvec);
-	return nr_reclaimed;
-}
+	local_irq_disable();
+	if (current_is_kswapd())
+		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
+	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-/*
- * We are about to scan this zone at a certain priority level.  If that priority
- * level is smaller (ie: more urgent) than the previous priority, then note
- * that priority level within the zone.  This is done so that when the next
- * process comes in to scan this zone, it will immediately start out at this
- * priority level rather than having to build up its own scanning priority.
- * Here, this priority affects only the reclaim-mapped threshold.
- */
-static inline void note_zone_scanning_priority(struct zone *zone, int priority)
-{
-	if (priority < zone->prev_priority)
-		zone->prev_priority = priority;
+	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
+	return nr_reclaimed;
 }
 
 /*
@@ -1729,13 +1774,12 @@ static void shrink_zone(int priority, struct zone *zone,
 static bool shrink_zones(int priority, struct zonelist *zonelist,
 					struct scan_control *sc)
 {
-	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	struct zoneref *z;
 	struct zone *zone;
 	bool all_unreclaimable = true;
 
-	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
-					sc->nodemask) {
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
 			continue;
 		/*
@@ -1745,17 +1789,8 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
 		if (scanning_global_lru(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
-			note_zone_scanning_priority(zone, priority);
-
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;	/* Let kswapd poll it */
-		} else {
-			/*
-			 * Ignore cpuset limitation here. We just want to reduce
-			 * # of used pages by us regardless of memory shortage.
-			 */
-			mem_cgroup_note_reclaim_priority(sc->mem_cgroup,
-							priority);
 		}
 
 		shrink_zone(priority, zone, sc);
@@ -1787,10 +1822,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	bool all_unreclaimable;
 	unsigned long total_scanned = 0;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
-	unsigned long lru_pages = 0;
 	struct zoneref *z;
 	struct zone *zone;
-	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	unsigned long writeback_threshold;
 
 	get_mems_allowed();
@@ -1798,18 +1831,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 	if (scanning_global_lru(sc))
 		count_vm_event(ALLOCSTALL);
-	/*
-	 * mem_cgroup will not do shrink_slab.
-	 */
-	if (scanning_global_lru(sc)) {
-		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-
-			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-				continue;
-
-			lru_pages += zone_reclaimable_pages(zone);
-		}
-	}
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		sc->nr_scanned = 0;
@@ -1821,6 +1842,15 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 * over limit cgroups
 		 */
 		if (scanning_global_lru(sc)) {
+			unsigned long lru_pages = 0;
+			for_each_zone_zonelist(zone, z, zonelist,
+					gfp_zone(sc->gfp_mask)) {
+				if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
+					continue;
+
+				lru_pages += zone_reclaimable_pages(zone);
+			}
+
 			shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages);
 			if (reclaim_state) {
 				sc->nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -1861,17 +1891,6 @@ out:
 	if (priority < 0)
 		priority = 0;
 
-	if (scanning_global_lru(sc)) {
-		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-
-			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-				continue;
-
-			zone->prev_priority = priority;
-		}
-	} else
-		mem_cgroup_record_reclaim_priority(sc->mem_cgroup, priority);
-
 	delayacct_freepages_end();
 	put_mems_allowed();
 
@@ -1888,6 +1907,7 @@ out:
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 				gfp_t gfp_mask, nodemask_t *nodemask)
 {
+	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
 		.may_writepage = !laptop_mode,
@@ -1900,7 +1920,15 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.nodemask = nodemask,
 	};
 
-	return do_try_to_free_pages(zonelist, &sc);
+	trace_mm_vmscan_direct_reclaim_begin(order,
+				sc.may_writepage,
+				gfp_mask);
+
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+
+	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
+
+	return nr_reclaimed;
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
@@ -2028,22 +2056,12 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 		.order = order,
 		.mem_cgroup = NULL,
 	};
-	/*
-	 * temp_priority is used to remember the scanning priority at which
-	 * this zone was successfully refilled to
-	 * free_pages == high_wmark_pages(zone).
-	 */
-	int temp_priority[MAX_NR_ZONES];
-
 loop_again:
 	total_scanned = 0;
 	sc.nr_reclaimed = 0;
 	sc.may_writepage = !laptop_mode;
 	count_vm_event(PAGEOUTRUN);
 
-	for (i = 0; i < pgdat->nr_zones; i++)
-		temp_priority[i] = DEF_PRIORITY;
-
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 		unsigned long lru_pages = 0;
@@ -2111,9 +2129,7 @@ loop_again:
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;
 
-			temp_priority[i] = priority;
 			sc.nr_scanned = 0;
-			note_zone_scanning_priority(zone, priority);
 
 			nid = pgdat->node_id;
 			zid = zone_idx(zone);
@@ -2186,16 +2202,6 @@ loop_again:
 			break;
 	}
 out:
-	/*
-	 * Note within each zone the priority level at which this zone was
-	 * brought into a happy state.  So that the next thread which scans this
-	 * zone will start out at that priority level.
-	 */
-	for (i = 0; i < pgdat->nr_zones; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
-		zone->prev_priority = temp_priority[i];
-	}
 	if (!all_zones_ok) {
 		cond_resched();
 
@@ -2299,9 +2305,10 @@ static int kswapd(void *p)
 				 * premature sleep. If not, then go fully
 				 * to sleep until explicitly woken up
 				 */
-				if (!sleeping_prematurely(pgdat, order, remaining))
+				if (!sleeping_prematurely(pgdat, order, remaining)) {
+					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 					schedule();
-				else {
+				} else {
 					if (remaining)
 						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
 					else
@@ -2321,8 +2328,10 @@ static int kswapd(void *p)
 		 * We can speed up thawing tasks if we don't call balance_pgdat
 		 * after returning from the refrigerator
 		 */
-		if (!ret)
+		if (!ret) {
+			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
 			balance_pgdat(pgdat, order);
+		}
 	}
 	return 0;
 }
@@ -2342,6 +2351,7 @@ void wakeup_kswapd(struct zone *zone, int order)
 		return;
 	if (pgdat->kswapd_max_order < order)
 		pgdat->kswapd_max_order = order;
+	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 		return;
 	if (!waitqueue_active(&pgdat->kswapd_wait))
@@ -2611,7 +2621,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		 */
 		priority = ZONE_RECLAIM_PRIORITY;
 		do {
-			note_zone_scanning_priority(zone, priority);
 			shrink_zone(priority, zone, &sc);
 			priority--;
 		} while (priority >= 0 && sc.nr_reclaimed < nr_pages);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7759941..5c0b1b6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -853,11 +853,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 	}
 	seq_printf(m,
 		   "\n  all_unreclaimable: %u"
-		   "\n  prev_priority:     %i"
 		   "\n  start_pfn:         %lu"
 		   "\n  inactive_ratio:    %u",
 		   zone->all_unreclaimable,
-		   zone->prev_priority,
 		   zone->zone_start_pfn,
 		   zone->inactive_ratio);
 	seq_putc(m, '\n');
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 1/6] vmscan: tracing: Roll up of patches currently in mmotm
@ 2010-07-30 13:36   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2010-07-30 13:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrea Arcangeli, Mel Gorman

This is a roll-up of patches currently in mmotm related to stack reduction and
tracing reclaim. It is based on 2.6.35-rc6 and included for the convenience
of testing.

No signed off required.
---
 .../trace/postprocess/trace-vmscan-postprocess.pl  |  654 ++++++++++++++++++++
 include/linux/memcontrol.h                         |    5 -
 include/linux/mmzone.h                             |   15 -
 include/trace/events/gfpflags.h                    |   37 ++
 include/trace/events/kmem.h                        |   38 +--
 include/trace/events/vmscan.h                      |  184 ++++++
 mm/memcontrol.c                                    |   31 -
 mm/page_alloc.c                                    |    2 -
 mm/vmscan.c                                        |  429 +++++++-------
 mm/vmstat.c                                        |    2 -
 10 files changed, 1095 insertions(+), 302 deletions(-)
 create mode 100644 Documentation/trace/postprocess/trace-vmscan-postprocess.pl
 create mode 100644 include/trace/events/gfpflags.h
 create mode 100644 include/trace/events/vmscan.h

diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
new file mode 100644
index 0000000..d1ddc33
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -0,0 +1,654 @@
+#!/usr/bin/perl
+# This is a POC for reading the text representation of trace output related to
+# page reclaim. It makes an attempt to extract some high-level information on
+# what is going on. The accuracy of the parser may vary
+#
+# Example usage: trace-vmscan-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
+# other options
+#   --read-procstat	If the trace lacks process info, get it from /proc
+#   --ignore-pid	Aggregate processes of the same name together
+#
+# Copyright (c) IBM Corporation 2009
+# Author: Mel Gorman <mel@csn.ul.ie>
+use strict;
+use Getopt::Long;
+
+# Tracepoint events
+use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN	=> 1;
+use constant MM_VMSCAN_DIRECT_RECLAIM_END	=> 2;
+use constant MM_VMSCAN_KSWAPD_WAKE		=> 3;
+use constant MM_VMSCAN_KSWAPD_SLEEP		=> 4;
+use constant MM_VMSCAN_LRU_SHRINK_ACTIVE	=> 5;
+use constant MM_VMSCAN_LRU_SHRINK_INACTIVE	=> 6;
+use constant MM_VMSCAN_LRU_ISOLATE		=> 7;
+use constant MM_VMSCAN_WRITEPAGE_SYNC		=> 8;
+use constant MM_VMSCAN_WRITEPAGE_ASYNC		=> 9;
+use constant EVENT_UNKNOWN			=> 10;
+
+# Per-order events
+use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER => 11;
+use constant MM_VMSCAN_WAKEUP_KSWAPD_PERORDER 	=> 12;
+use constant MM_VMSCAN_KSWAPD_WAKE_PERORDER	=> 13;
+use constant HIGH_KSWAPD_REWAKEUP_PERORDER	=> 14;
+
+# Constants used to track state
+use constant STATE_DIRECT_BEGIN 		=> 15;
+use constant STATE_DIRECT_ORDER 		=> 16;
+use constant STATE_KSWAPD_BEGIN			=> 17;
+use constant STATE_KSWAPD_ORDER			=> 18;
+
+# High-level events extrapolated from tracepoints
+use constant HIGH_DIRECT_RECLAIM_LATENCY	=> 19;
+use constant HIGH_KSWAPD_LATENCY		=> 20;
+use constant HIGH_KSWAPD_REWAKEUP		=> 21;
+use constant HIGH_NR_SCANNED			=> 22;
+use constant HIGH_NR_TAKEN			=> 23;
+use constant HIGH_NR_RECLAIM			=> 24;
+use constant HIGH_NR_CONTIG_DIRTY		=> 25;
+
+my %perprocesspid;
+my %perprocess;
+my %last_procmap;
+my $opt_ignorepid;
+my $opt_read_procstat;
+
+my $total_wakeup_kswapd;
+my ($total_direct_reclaim, $total_direct_nr_scanned);
+my ($total_direct_latency, $total_kswapd_latency);
+my ($total_direct_writepage_sync, $total_direct_writepage_async);
+my ($total_kswapd_nr_scanned, $total_kswapd_wake);
+my ($total_kswapd_writepage_sync, $total_kswapd_writepage_async);
+
+# Catch sigint and exit on request
+my $sigint_report = 0;
+my $sigint_exit = 0;
+my $sigint_pending = 0;
+my $sigint_received = 0;
+sub sigint_handler {
+	my $current_time = time;
+	if ($current_time - 2 > $sigint_received) {
+		print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
+		$sigint_report = 1;
+	} else {
+		if (!$sigint_exit) {
+			print "Second SIGINT received quickly, exiting\n";
+		}
+		$sigint_exit++;
+	}
+
+	if ($sigint_exit > 3) {
+		print "Many SIGINTs received, exiting now without report\n";
+		exit;
+	}
+
+	$sigint_received = $current_time;
+	$sigint_pending = 1;
+}
+$SIG{INT} = "sigint_handler";
+
+# Parse command line options
+GetOptions(
+	'ignore-pid'	 =>	\$opt_ignorepid,
+	'read-procstat'	 =>	\$opt_read_procstat,
+);
+
+# Defaults for dynamically discovered regex's
+my $regex_direct_begin_default = 'order=([0-9]*) may_writepage=([0-9]*) gfp_flags=([A-Z_|]*)';
+my $regex_direct_end_default = 'nr_reclaimed=([0-9]*)';
+my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)';
+my $regex_kswapd_sleep_default = 'nid=([0-9]*)';
+my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
+my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)';
+my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)';
+my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
+my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) sync_io=([0-9]*)';
+
+# Dyanically discovered regex
+my $regex_direct_begin;
+my $regex_direct_end;
+my $regex_kswapd_wake;
+my $regex_kswapd_sleep;
+my $regex_wakeup_kswapd;
+my $regex_lru_isolate;
+my $regex_lru_shrink_inactive;
+my $regex_lru_shrink_active;
+my $regex_writepage;
+
+# Static regex used. Specified like this for readability and for use with /o
+#                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
+my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
+my $regex_statname = '[-0-9]*\s\((.*)\).*';
+my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
+
+sub generate_traceevent_regex {
+	my $event = shift;
+	my $default = shift;
+	my $regex;
+
+	# Read the event format or use the default
+	if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
+		print("WARNING: Event $event format string not found\n");
+		return $default;
+	} else {
+		my $line;
+		while (!eof(FORMAT)) {
+			$line = <FORMAT>;
+			$line =~ s/, REC->.*//;
+			if ($line =~ /^print fmt:\s"(.*)".*/) {
+				$regex = $1;
+				$regex =~ s/%s/\([0-9a-zA-Z|_]*\)/g;
+				$regex =~ s/%p/\([0-9a-f]*\)/g;
+				$regex =~ s/%d/\([-0-9]*\)/g;
+				$regex =~ s/%ld/\([-0-9]*\)/g;
+				$regex =~ s/%lu/\([0-9]*\)/g;
+			}
+		}
+	}
+
+	# Can't handle the print_flags stuff but in the context of this
+	# script, it really doesn't matter
+	$regex =~ s/\(REC.*\) \? __print_flags.*//;
+
+	# Verify fields are in the right order
+	my $tuple;
+	foreach $tuple (split /\s/, $regex) {
+		my ($key, $value) = split(/=/, $tuple);
+		my $expected = shift;
+		if ($key ne $expected) {
+			print("WARNING: Format not as expected for event $event '$key' != '$expected'\n");
+			$regex =~ s/$key=\((.*)\)/$key=$1/;
+		}
+	}
+
+	if (defined shift) {
+		die("Fewer fields than expected in format");
+	}
+
+	return $regex;
+}
+
+$regex_direct_begin = generate_traceevent_regex(
+			"vmscan/mm_vmscan_direct_reclaim_begin",
+			$regex_direct_begin_default,
+			"order", "may_writepage",
+			"gfp_flags");
+$regex_direct_end = generate_traceevent_regex(
+			"vmscan/mm_vmscan_direct_reclaim_end",
+			$regex_direct_end_default,
+			"nr_reclaimed");
+$regex_kswapd_wake = generate_traceevent_regex(
+			"vmscan/mm_vmscan_kswapd_wake",
+			$regex_kswapd_wake_default,
+			"nid", "order");
+$regex_kswapd_sleep = generate_traceevent_regex(
+			"vmscan/mm_vmscan_kswapd_sleep",
+			$regex_kswapd_sleep_default,
+			"nid");
+$regex_wakeup_kswapd = generate_traceevent_regex(
+			"vmscan/mm_vmscan_wakeup_kswapd",
+			$regex_wakeup_kswapd_default,
+			"nid", "zid", "order");
+$regex_lru_isolate = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_isolate",
+			$regex_lru_isolate_default,
+			"isolate_mode", "order",
+			"nr_requested", "nr_scanned", "nr_taken",
+			"contig_taken", "contig_dirty", "contig_failed");
+$regex_lru_shrink_inactive = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_shrink_inactive",
+			$regex_lru_shrink_inactive_default,
+			"nid", "zid",
+			"lru",
+			"nr_scanned", "nr_reclaimed", "priority");
+$regex_lru_shrink_active = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_shrink_active",
+			$regex_lru_shrink_active_default,
+			"nid", "zid",
+			"lru",
+			"nr_scanned", "nr_rotated", "priority");
+$regex_writepage = generate_traceevent_regex(
+			"vmscan/mm_vmscan_writepage",
+			$regex_writepage_default,
+			"page", "pfn", "sync_io");
+
+sub read_statline($) {
+	my $pid = $_[0];
+	my $statline;
+
+	if (open(STAT, "/proc/$pid/stat")) {
+		$statline = <STAT>;
+		close(STAT);
+	}
+
+	if ($statline eq '') {
+		$statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
+	}
+
+	return $statline;
+}
+
+sub guess_process_pid($$) {
+	my $pid = $_[0];
+	my $statline = $_[1];
+
+	if ($pid == 0) {
+		return "swapper-0";
+	}
+
+	if ($statline !~ /$regex_statname/o) {
+		die("Failed to math stat line for process name :: $statline");
+	}
+	return "$1-$pid";
+}
+
+# Convert sec.usec timestamp format
+sub timestamp_to_ms($) {
+	my $timestamp = $_[0];
+
+	my ($sec, $usec) = split (/\./, $timestamp);
+	return ($sec * 1000) + ($usec / 1000);
+}
+
+sub process_events {
+	my $traceevent;
+	my $process_pid;
+	my $cpus;
+	my $timestamp;
+	my $tracepoint;
+	my $details;
+	my $statline;
+
+	# Read each line of the event log
+EVENT_PROCESS:
+	while ($traceevent = <STDIN>) {
+		if ($traceevent =~ /$regex_traceevent/o) {
+			$process_pid = $1;
+			$timestamp = $3;
+			$tracepoint = $4;
+
+			$process_pid =~ /(.*)-([0-9]*)$/;
+			my $process = $1;
+			my $pid = $2;
+
+			if ($process eq "") {
+				$process = $last_procmap{$pid};
+				$process_pid = "$process-$pid";
+			}
+			$last_procmap{$pid} = $process;
+
+			if ($opt_read_procstat) {
+				$statline = read_statline($pid);
+				if ($opt_read_procstat && $process eq '') {
+					$process_pid = guess_process_pid($pid, $statline);
+				}
+			}
+		} else {
+			next;
+		}
+
+		# Perl Switch() sucks majorly
+		if ($tracepoint eq "mm_vmscan_direct_reclaim_begin") {
+			$timestamp = timestamp_to_ms($timestamp);
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}++;
+			$perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN} = $timestamp;
+
+			$details = $5;
+			if ($details !~ /$regex_direct_begin/o) {
+				print "WARNING: Failed to parse mm_vmscan_direct_reclaim_begin as expected\n";
+				print "         $details\n";
+				print "         $regex_direct_begin\n";
+				next;
+			}
+			my $order = $1;
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]++;
+			$perprocesspid{$process_pid}->{STATE_DIRECT_ORDER} = $order;
+		} elsif ($tracepoint eq "mm_vmscan_direct_reclaim_end") {
+			# Count the event itself
+			my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END};
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END}++;
+
+			# Record how long direct reclaim took this time
+			if (defined $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN}) {
+				$timestamp = timestamp_to_ms($timestamp);
+				my $order = $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER};
+				my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN});
+				$perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] = "$order-$latency";
+			}
+		} elsif ($tracepoint eq "mm_vmscan_kswapd_wake") {
+			$details = $5;
+			if ($details !~ /$regex_kswapd_wake/o) {
+				print "WARNING: Failed to parse mm_vmscan_kswapd_wake as expected\n";
+				print "         $details\n";
+				print "         $regex_kswapd_wake\n";
+				next;
+			}
+
+			my $order = $2;
+			$perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER} = $order;
+			if (!$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN}) {
+				$timestamp = timestamp_to_ms($timestamp);
+				$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}++;
+				$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = $timestamp;
+				$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]++;
+			} else {
+				$perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}++;
+				$perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order]++;
+			}
+		} elsif ($tracepoint eq "mm_vmscan_kswapd_sleep") {
+
+			# Count the event itself
+			my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP};
+			$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP}++;
+
+			# Record how long kswapd was awake
+			$timestamp = timestamp_to_ms($timestamp);
+			my $order = $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER};
+			my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN});
+			$perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index] = "$order-$latency";
+			$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = 0;
+		} elsif ($tracepoint eq "mm_vmscan_wakeup_kswapd") {
+			$perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}++;
+
+			$details = $5;
+			if ($details !~ /$regex_wakeup_kswapd/o) {
+				print "WARNING: Failed to parse mm_vmscan_wakeup_kswapd as expected\n";
+				print "         $details\n";
+				print "         $regex_wakeup_kswapd\n";
+				next;
+			}
+			my $order = $3;
+			$perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]++;
+		} elsif ($tracepoint eq "mm_vmscan_lru_isolate") {
+			$details = $5;
+			if ($details !~ /$regex_lru_isolate/o) {
+				print "WARNING: Failed to parse mm_vmscan_lru_isolate as expected\n";
+				print "         $details\n";
+				print "         $regex_lru_isolate/o\n";
+				next;
+			}
+			my $nr_scanned = $4;
+			my $nr_contig_dirty = $7;
+			$perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
+			$perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty;
+		} elsif ($tracepoint eq "mm_vmscan_writepage") {
+			$details = $5;
+			if ($details !~ /$regex_writepage/o) {
+				print "WARNING: Failed to parse mm_vmscan_writepage as expected\n";
+				print "         $details\n";
+				print "         $regex_writepage\n";
+				next;
+			}
+
+			my $sync_io = $3;
+			if ($sync_io) {
+				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}++;
+			} else {
+				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}++;
+			}
+		} else {
+			$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
+		}
+
+		if ($sigint_pending) {
+			last EVENT_PROCESS;
+		}
+	}
+}
+
+sub dump_stats {
+	my $hashref = shift;
+	my %stats = %$hashref;
+
+	# Dump per-process stats
+	my $process_pid;
+	my $max_strlen = 0;
+
+	# Get the maximum process name
+	foreach $process_pid (keys %perprocesspid) {
+		my $len = length($process_pid);
+		if ($len > $max_strlen) {
+			$max_strlen = $len;
+		}
+	}
+	$max_strlen += 2;
+
+	# Work out latencies
+	printf("\n") if !$opt_ignorepid;
+	printf("Reclaim latencies expressed as order-latency_in_ms\n") if !$opt_ignorepid;
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[0] &&
+				!$stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[0]) {
+			next;
+		}
+
+		printf "%-" . $max_strlen . "s ", $process_pid if !$opt_ignorepid;
+		my $index = 0;
+		while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] ||
+			defined $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) {
+
+			if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
+				printf("%s ", $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) if !$opt_ignorepid;
+				my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
+				$total_direct_latency += $latency;
+			} else {
+				printf("%s ", $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) if !$opt_ignorepid;
+				my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]);
+				$total_kswapd_latency += $latency;
+			}
+			$index++;
+		}
+		print "\n" if !$opt_ignorepid;
+	}
+
+	# Print out process activity
+	printf("\n");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "Process", "Direct",  "Wokeup", "Pages",   "Pages",   "Pages",     "Time");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "details", "Rclms",   "Kswapd", "Scanned", "Sync-IO", "ASync-IO",  "Stalled");
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
+			next;
+		}
+
+		$total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
+		$total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
+		$total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_direct_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$total_direct_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		my $index = 0;
+		my $this_reclaim_delay = 0;
+		while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
+			 my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
+			$this_reclaim_delay += $latency;
+			$index++;
+		}
+
+		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8u %8u %8.3f",
+			$process_pid,
+			$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
+			$stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
+			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC},
+			$this_reclaim_delay / 1000);
+
+		if ($stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
+				if ($count != 0) {
+					print "direct-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
+				if ($count != 0) {
+					print "wakeup-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY}) {
+			print "      ";
+			my $count = $stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY};
+			if ($count != 0) {
+				print "contig-dirty=$count ";
+			}
+		}
+
+		print "\n";
+	}
+
+	# Print out kswapd activity
+	printf("\n");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Kswapd",   "Kswapd",  "Order",     "Pages",   "Pages",  "Pages");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Sync-IO", "ASync-IO");
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
+			next;
+		}
+
+		$total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
+		$total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_kswapd_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$total_kswapd_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8i %8u",
+			$process_pid,
+			$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
+			$stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
+			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC});
+
+		if ($stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
+				if ($count != 0) {
+					print "wake-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order];
+				if ($count != 0) {
+					print "rewake-$order=$count ";
+				}
+			}
+		}
+		printf("\n");
+	}
+
+	# Print out summaries
+	$total_direct_latency /= 1000;
+	$total_kswapd_latency /= 1000;
+	print "\nSummary\n";
+	print "Direct reclaims:     		$total_direct_reclaim\n";
+	print "Direct reclaim pages scanned:	$total_direct_nr_scanned\n";
+	print "Direct reclaim write sync I/O:	$total_direct_writepage_sync\n";
+	print "Direct reclaim write async I/O:	$total_direct_writepage_async\n";
+	print "Wake kswapd requests:		$total_wakeup_kswapd\n";
+	printf "Time stalled direct reclaim: 	%-1.2f ms\n", $total_direct_latency;
+	print "\n";
+	print "Kswapd wakeups:			$total_kswapd_wake\n";
+	print "Kswapd pages scanned:		$total_kswapd_nr_scanned\n";
+	print "Kswapd reclaim write sync I/O:	$total_kswapd_writepage_sync\n";
+	print "Kswapd reclaim write async I/O:	$total_kswapd_writepage_async\n";
+	printf "Time kswapd awake:		%-1.2f ms\n", $total_kswapd_latency;
+}
+
+sub aggregate_perprocesspid() {
+	my $process_pid;
+	my $process;
+	undef %perprocess;
+
+	foreach $process_pid (keys %perprocesspid) {
+		$process = $process_pid;
+		$process =~ s/-([0-9])*$//;
+		if ($process eq '') {
+			$process = "NO_PROCESS_NAME";
+		}
+
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN} += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
+		$perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE} += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
+		$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
+		$perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
+		$perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		for (my $order = 0; $order < 20; $order++) {
+			$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
+			$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
+			$perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
+
+		}
+
+		# Aggregate direct reclaim latencies
+		my $wr_index = $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END};
+		my $rd_index = 0;
+		while (defined $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index]) {
+			$perprocess{$process}->{HIGH_DIRECT_RECLAIM_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index];
+			$rd_index++;
+			$wr_index++;
+		}
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
+
+		# Aggregate kswapd latencies
+		my $wr_index = $perprocess{$process}->{MM_VMSCAN_KSWAPD_SLEEP};
+		my $rd_index = 0;
+		while (defined $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index]) {
+			$perprocess{$process}->{HIGH_KSWAPD_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index];
+			$rd_index++;
+			$wr_index++;
+		}
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
+	}
+}
+
+sub report() {
+	if (!$opt_ignorepid) {
+		dump_stats(\%perprocesspid);
+	} else {
+		aggregate_perprocesspid();
+		dump_stats(\%perprocess);
+	}
+}
+
+# Process events or signals until neither is available
+sub signal_loop() {
+	my $sigint_processed;
+	do {
+		$sigint_processed = 0;
+		process_events();
+
+		# Handle pending signals if any
+		if ($sigint_pending) {
+			my $current_time = time;
+
+			if ($sigint_exit) {
+				print "Received exit signal\n";
+				$sigint_pending = 0;
+			}
+			if ($sigint_report) {
+				if ($current_time >= $sigint_received + 2) {
+					report();
+					$sigint_report = 0;
+					$sigint_pending = 0;
+					$sigint_processed = 1;
+				}
+			}
+		}
+	} while ($sigint_pending || $sigint_processed);
+}
+
+signal_loop();
+report();
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9411d32..9f1afd3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -98,11 +98,6 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
 /*
  * For memory reclaim.
  */
-extern int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem);
-extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
-							int priority);
-extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
-							int priority);
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4d109e..b578eee 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -348,21 +348,6 @@ struct zone {
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 
 	/*
-	 * prev_priority holds the scanning priority for this zone.  It is
-	 * defined as the scanning priority at which we achieved our reclaim
-	 * target at the previous try_to_free_pages() or balance_pgdat()
-	 * invocation.
-	 *
-	 * We use prev_priority as a measure of how much stress page reclaim is
-	 * under - it drives the swappiness decision: whether to unmap mapped
-	 * pages.
-	 *
-	 * Access to both this field is quite racy even on uniprocessor.  But
-	 * it is expected to average out OK.
-	 */
-	int prev_priority;
-
-	/*
 	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
 	 * this zone's LRU.  Maintained by the pageout code.
 	 */
diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
new file mode 100644
index 0000000..e3615c0
--- /dev/null
+++ b/include/trace/events/gfpflags.h
@@ -0,0 +1,37 @@
+/*
+ * The order of these masks is important. Matching masks will be seen
+ * first and the left over flags will end up showing by themselves.
+ *
+ * For example, if we have GFP_KERNEL before GFP_USER we wil get:
+ *
+ *  GFP_KERNEL|GFP_HARDWALL
+ *
+ * Thus most bits set go first.
+ */
+#define show_gfp_flags(flags)						\
+	(flags) ? __print_flags(flags, "|",				\
+	{(unsigned long)GFP_HIGHUSER_MOVABLE,	"GFP_HIGHUSER_MOVABLE"}, \
+	{(unsigned long)GFP_HIGHUSER,		"GFP_HIGHUSER"},	\
+	{(unsigned long)GFP_USER,		"GFP_USER"},		\
+	{(unsigned long)GFP_TEMPORARY,		"GFP_TEMPORARY"},	\
+	{(unsigned long)GFP_KERNEL,		"GFP_KERNEL"},		\
+	{(unsigned long)GFP_NOFS,		"GFP_NOFS"},		\
+	{(unsigned long)GFP_ATOMIC,		"GFP_ATOMIC"},		\
+	{(unsigned long)GFP_NOIO,		"GFP_NOIO"},		\
+	{(unsigned long)__GFP_HIGH,		"GFP_HIGH"},		\
+	{(unsigned long)__GFP_WAIT,		"GFP_WAIT"},		\
+	{(unsigned long)__GFP_IO,		"GFP_IO"},		\
+	{(unsigned long)__GFP_COLD,		"GFP_COLD"},		\
+	{(unsigned long)__GFP_NOWARN,		"GFP_NOWARN"},		\
+	{(unsigned long)__GFP_REPEAT,		"GFP_REPEAT"},		\
+	{(unsigned long)__GFP_NOFAIL,		"GFP_NOFAIL"},		\
+	{(unsigned long)__GFP_NORETRY,		"GFP_NORETRY"},		\
+	{(unsigned long)__GFP_COMP,		"GFP_COMP"},		\
+	{(unsigned long)__GFP_ZERO,		"GFP_ZERO"},		\
+	{(unsigned long)__GFP_NOMEMALLOC,	"GFP_NOMEMALLOC"},	\
+	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
+	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
+	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
+	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"}		\
+	) : "GFP_NOWAIT"
+
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 3adca0c..a9c87ad 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -6,43 +6,7 @@
 
 #include <linux/types.h>
 #include <linux/tracepoint.h>
-
-/*
- * The order of these masks is important. Matching masks will be seen
- * first and the left over flags will end up showing by themselves.
- *
- * For example, if we have GFP_KERNEL before GFP_USER we wil get:
- *
- *  GFP_KERNEL|GFP_HARDWALL
- *
- * Thus most bits set go first.
- */
-#define show_gfp_flags(flags)						\
-	(flags) ? __print_flags(flags, "|",				\
-	{(unsigned long)GFP_HIGHUSER_MOVABLE,	"GFP_HIGHUSER_MOVABLE"}, \
-	{(unsigned long)GFP_HIGHUSER,		"GFP_HIGHUSER"},	\
-	{(unsigned long)GFP_USER,		"GFP_USER"},		\
-	{(unsigned long)GFP_TEMPORARY,		"GFP_TEMPORARY"},	\
-	{(unsigned long)GFP_KERNEL,		"GFP_KERNEL"},		\
-	{(unsigned long)GFP_NOFS,		"GFP_NOFS"},		\
-	{(unsigned long)GFP_ATOMIC,		"GFP_ATOMIC"},		\
-	{(unsigned long)GFP_NOIO,		"GFP_NOIO"},		\
-	{(unsigned long)__GFP_HIGH,		"GFP_HIGH"},		\
-	{(unsigned long)__GFP_WAIT,		"GFP_WAIT"},		\
-	{(unsigned long)__GFP_IO,		"GFP_IO"},		\
-	{(unsigned long)__GFP_COLD,		"GFP_COLD"},		\
-	{(unsigned long)__GFP_NOWARN,		"GFP_NOWARN"},		\
-	{(unsigned long)__GFP_REPEAT,		"GFP_REPEAT"},		\
-	{(unsigned long)__GFP_NOFAIL,		"GFP_NOFAIL"},		\
-	{(unsigned long)__GFP_NORETRY,		"GFP_NORETRY"},		\
-	{(unsigned long)__GFP_COMP,		"GFP_COMP"},		\
-	{(unsigned long)__GFP_ZERO,		"GFP_ZERO"},		\
-	{(unsigned long)__GFP_NOMEMALLOC,	"GFP_NOMEMALLOC"},	\
-	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
-	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
-	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
-	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"}		\
-	) : "GFP_NOWAIT"
+#include "gfpflags.h"
 
 DECLARE_EVENT_CLASS(kmem_alloc,
 
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
new file mode 100644
index 0000000..f2da66a
--- /dev/null
+++ b/include/trace/events/vmscan.h
@@ -0,0 +1,184 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vmscan
+
+#if !defined(_TRACE_VMSCAN_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VMSCAN_H
+
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+#include "gfpflags.h"
+
+TRACE_EVENT(mm_vmscan_kswapd_sleep,
+
+	TP_PROTO(int nid),
+
+	TP_ARGS(nid),
+
+	TP_STRUCT__entry(
+		__field(	int,	nid	)
+	),
+
+	TP_fast_assign(
+		__entry->nid	= nid;
+	),
+
+	TP_printk("nid=%d", __entry->nid)
+);
+
+TRACE_EVENT(mm_vmscan_kswapd_wake,
+
+	TP_PROTO(int nid, int order),
+
+	TP_ARGS(nid, order),
+
+	TP_STRUCT__entry(
+		__field(	int,	nid	)
+		__field(	int,	order	)
+	),
+
+	TP_fast_assign(
+		__entry->nid	= nid;
+		__entry->order	= order;
+	),
+
+	TP_printk("nid=%d order=%d", __entry->nid, __entry->order)
+);
+
+TRACE_EVENT(mm_vmscan_wakeup_kswapd,
+
+	TP_PROTO(int nid, int zid, int order),
+
+	TP_ARGS(nid, zid, order),
+
+	TP_STRUCT__entry(
+		__field(	int,		nid	)
+		__field(	int,		zid	)
+		__field(	int,		order	)
+	),
+
+	TP_fast_assign(
+		__entry->nid		= nid;
+		__entry->zid		= zid;
+		__entry->order		= order;
+	),
+
+	TP_printk("nid=%d zid=%d order=%d",
+		__entry->nid,
+		__entry->zid,
+		__entry->order)
+);
+
+TRACE_EVENT(mm_vmscan_direct_reclaim_begin,
+
+	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags),
+
+	TP_ARGS(order, may_writepage, gfp_flags),
+
+	TP_STRUCT__entry(
+		__field(	int,	order		)
+		__field(	int,	may_writepage	)
+		__field(	gfp_t,	gfp_flags	)
+	),
+
+	TP_fast_assign(
+		__entry->order		= order;
+		__entry->may_writepage	= may_writepage;
+		__entry->gfp_flags	= gfp_flags;
+	),
+
+	TP_printk("order=%d may_writepage=%d gfp_flags=%s",
+		__entry->order,
+		__entry->may_writepage,
+		show_gfp_flags(__entry->gfp_flags))
+);
+
+TRACE_EVENT(mm_vmscan_direct_reclaim_end,
+
+	TP_PROTO(unsigned long nr_reclaimed),
+
+	TP_ARGS(nr_reclaimed),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	nr_reclaimed	)
+	),
+
+	TP_fast_assign(
+		__entry->nr_reclaimed	= nr_reclaimed;
+	),
+
+	TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed)
+);
+
+TRACE_EVENT(mm_vmscan_lru_isolate,
+
+	TP_PROTO(int order,
+		unsigned long nr_requested,
+		unsigned long nr_scanned,
+		unsigned long nr_taken,
+		unsigned long nr_lumpy_taken,
+		unsigned long nr_lumpy_dirty,
+		unsigned long nr_lumpy_failed,
+		int isolate_mode),
+
+	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, isolate_mode),
+
+	TP_STRUCT__entry(
+		__field(int, order)
+		__field(unsigned long, nr_requested)
+		__field(unsigned long, nr_scanned)
+		__field(unsigned long, nr_taken)
+		__field(unsigned long, nr_lumpy_taken)
+		__field(unsigned long, nr_lumpy_dirty)
+		__field(unsigned long, nr_lumpy_failed)
+		__field(int, isolate_mode)
+	),
+
+	TP_fast_assign(
+		__entry->order = order;
+		__entry->nr_requested = nr_requested;
+		__entry->nr_scanned = nr_scanned;
+		__entry->nr_taken = nr_taken;
+		__entry->nr_lumpy_taken = nr_lumpy_taken;
+		__entry->nr_lumpy_dirty = nr_lumpy_dirty;
+		__entry->nr_lumpy_failed = nr_lumpy_failed;
+		__entry->isolate_mode = isolate_mode;
+	),
+
+	TP_printk("isolate_mode=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu contig_taken=%lu contig_dirty=%lu contig_failed=%lu",
+		__entry->isolate_mode,
+		__entry->order,
+		__entry->nr_requested,
+		__entry->nr_scanned,
+		__entry->nr_taken,
+		__entry->nr_lumpy_taken,
+		__entry->nr_lumpy_dirty,
+		__entry->nr_lumpy_failed)
+);
+
+TRACE_EVENT(mm_vmscan_writepage,
+
+	TP_PROTO(struct page *page,
+		int sync_io),
+
+	TP_ARGS(page, sync_io),
+
+	TP_STRUCT__entry(
+		__field(struct page *, page)
+		__field(int, sync_io)
+	),
+
+	TP_fast_assign(
+		__entry->page = page;
+		__entry->sync_io = sync_io;
+	),
+
+	TP_printk("page=%p pfn=%lu sync_io=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->sync_io)
+);
+
+#endif /* _TRACE_VMSCAN_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 20a8193..31abd1c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -211,8 +211,6 @@ struct mem_cgroup {
 	*/
 	spinlock_t reclaim_param_lock;
 
-	int	prev_priority;	/* for recording reclaim priority */
-
 	/*
 	 * While reclaiming in a hierarchy, we cache the last child we
 	 * reclaimed from.
@@ -858,35 +856,6 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
 	return ret;
 }
 
-/*
- * prev_priority control...this will be used in memory reclaim path.
- */
-int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem)
-{
-	int prev_priority;
-
-	spin_lock(&mem->reclaim_param_lock);
-	prev_priority = mem->prev_priority;
-	spin_unlock(&mem->reclaim_param_lock);
-
-	return prev_priority;
-}
-
-void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem, int priority)
-{
-	spin_lock(&mem->reclaim_param_lock);
-	if (priority < mem->prev_priority)
-		mem->prev_priority = priority;
-	spin_unlock(&mem->reclaim_param_lock);
-}
-
-void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, int priority)
-{
-	spin_lock(&mem->reclaim_param_lock);
-	mem->prev_priority = priority;
-	spin_unlock(&mem->reclaim_param_lock);
-}
-
 static int calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_pages)
 {
 	unsigned long active;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9bd339e..eefc8b5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4089,8 +4089,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		zone_seqlock_init(zone);
 		zone->zone_pgdat = pgdat;
 
-		zone->prev_priority = DEF_PRIORITY;
-
 		zone_pcp_init(zone);
 		for_each_lru(l) {
 			INIT_LIST_HEAD(&zone->lru[l].list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b94fe1b..63447ff 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -48,6 +48,9 @@
 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/vmscan.h>
+
 struct scan_control {
 	/* Incremented by the number of inactive pages that were scanned */
 	unsigned long nr_scanned;
@@ -398,6 +401,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			/* synchronous write or broken a_ops? */
 			ClearPageReclaim(page);
 		}
+		trace_mm_vmscan_writepage(page,
+			sync_writeback == PAGEOUT_IO_SYNC);
 		inc_zone_page_state(page, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
 	}
@@ -617,6 +622,24 @@ static enum page_references page_check_references(struct page *page,
 	return PAGEREF_RECLAIM;
 }
 
+static noinline_for_stack void free_page_list(struct list_head *free_pages)
+{
+	struct pagevec freed_pvec;
+	struct page *page, *tmp;
+
+	pagevec_init(&freed_pvec, 1);
+
+	list_for_each_entry_safe(page, tmp, free_pages, lru) {
+		list_del(&page->lru);
+		if (!pagevec_add(&freed_pvec, page)) {
+			__pagevec_free(&freed_pvec);
+			pagevec_reinit(&freed_pvec);
+		}
+	}
+
+	pagevec_free(&freed_pvec);
+}
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -625,13 +648,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 					enum pageout_io sync_writeback)
 {
 	LIST_HEAD(ret_pages);
-	struct pagevec freed_pvec;
+	LIST_HEAD(free_pages);
 	int pgactivate = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
 
-	pagevec_init(&freed_pvec, 1);
 	while (!list_empty(page_list)) {
 		enum page_references references;
 		struct address_space *mapping;
@@ -806,10 +828,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		__clear_page_locked(page);
 free_it:
 		nr_reclaimed++;
-		if (!pagevec_add(&freed_pvec, page)) {
-			__pagevec_free(&freed_pvec);
-			pagevec_reinit(&freed_pvec);
-		}
+
+		/*
+		 * Is there need to periodically free_page_list? It would
+		 * appear not as the counts should be low
+		 */
+		list_add(&page->lru, &free_pages);
 		continue;
 
 cull_mlocked:
@@ -832,9 +856,10 @@ keep:
 		list_add(&page->lru, &ret_pages);
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
+
+	free_page_list(&free_pages);
+
 	list_splice(&ret_pages, page_list);
-	if (pagevec_count(&freed_pvec))
-		__pagevec_free(&freed_pvec);
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
@@ -916,6 +941,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		unsigned long *scanned, int order, int mode, int file)
 {
 	unsigned long nr_taken = 0;
+	unsigned long nr_lumpy_taken = 0;
+	unsigned long nr_lumpy_dirty = 0;
+	unsigned long nr_lumpy_failed = 0;
 	unsigned long scan;
 
 	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
@@ -993,12 +1021,25 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 				list_move(&cursor_page->lru, dst);
 				mem_cgroup_del_lru(cursor_page);
 				nr_taken++;
+				nr_lumpy_taken++;
+				if (PageDirty(cursor_page))
+					nr_lumpy_dirty++;
 				scan++;
+			} else {
+				if (mode == ISOLATE_BOTH &&
+						page_count(cursor_page))
+					nr_lumpy_failed++;
 			}
 		}
 	}
 
 	*scanned = scan;
+
+	trace_mm_vmscan_lru_isolate(order,
+			nr_to_scan, scan,
+			nr_taken,
+			nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed,
+			mode);
 	return nr_taken;
 }
 
@@ -1035,7 +1076,8 @@ static unsigned long clear_active_flags(struct list_head *page_list,
 			ClearPageActive(page);
 			nr_active++;
 		}
-		count[lru]++;
+		if (count)
+			count[lru]++;
 	}
 
 	return nr_active;
@@ -1112,174 +1154,177 @@ static int too_many_isolated(struct zone *zone, int file,
 }
 
 /*
- * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
- * of reclaimed pages
+ * TODO: Try merging with migrations version of putback_lru_pages
  */
-static unsigned long shrink_inactive_list(unsigned long max_scan,
-			struct zone *zone, struct scan_control *sc,
-			int priority, int file)
+static noinline_for_stack void
+putback_lru_pages(struct zone *zone, struct scan_control *sc,
+				unsigned long nr_anon, unsigned long nr_file,
+				struct list_head *page_list)
 {
-	LIST_HEAD(page_list);
+	struct page *page;
 	struct pagevec pvec;
-	unsigned long nr_scanned = 0;
-	unsigned long nr_reclaimed = 0;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 
-	while (unlikely(too_many_isolated(zone, file, sc))) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+	pagevec_init(&pvec, 1);
 
-		/* We are about to die and free our memory. Return now. */
-		if (fatal_signal_pending(current))
-			return SWAP_CLUSTER_MAX;
+	/*
+	 * Put back any unfreeable pages.
+	 */
+	spin_lock(&zone->lru_lock);
+	while (!list_empty(page_list)) {
+		int lru;
+		page = lru_to_page(page_list);
+		VM_BUG_ON(PageLRU(page));
+		list_del(&page->lru);
+		if (unlikely(!page_evictable(page, NULL))) {
+			spin_unlock_irq(&zone->lru_lock);
+			putback_lru_page(page);
+			spin_lock_irq(&zone->lru_lock);
+			continue;
+		}
+		SetPageLRU(page);
+		lru = page_lru(page);
+		add_page_to_lru_list(zone, page, lru);
+		if (is_active_lru(lru)) {
+			int file = is_file_lru(lru);
+			reclaim_stat->recent_rotated[file]++;
+		}
+		if (!pagevec_add(&pvec, page)) {
+			spin_unlock_irq(&zone->lru_lock);
+			__pagevec_release(&pvec);
+			spin_lock_irq(&zone->lru_lock);
+		}
 	}
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
 
+	spin_unlock_irq(&zone->lru_lock);
+	pagevec_release(&pvec);
+}
 
-	pagevec_init(&pvec, 1);
+static noinline_for_stack void update_isolated_counts(struct zone *zone,
+					struct scan_control *sc,
+					unsigned long *nr_anon,
+					unsigned long *nr_file,
+					struct list_head *isolated_list)
+{
+	unsigned long nr_active;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 
-	lru_add_drain();
-	spin_lock_irq(&zone->lru_lock);
-	do {
-		struct page *page;
-		unsigned long nr_taken;
-		unsigned long nr_scan;
-		unsigned long nr_freed;
-		unsigned long nr_active;
-		unsigned int count[NR_LRU_LISTS] = { 0, };
-		int mode = sc->lumpy_reclaim_mode ? ISOLATE_BOTH : ISOLATE_INACTIVE;
-		unsigned long nr_anon;
-		unsigned long nr_file;
+	nr_active = clear_active_flags(isolated_list, count);
+	__count_vm_events(PGDEACTIVATE, nr_active);
 
-		if (scanning_global_lru(sc)) {
-			nr_taken = isolate_pages_global(SWAP_CLUSTER_MAX,
-							&page_list, &nr_scan,
-							sc->order, mode,
-							zone, 0, file);
-			zone->pages_scanned += nr_scan;
-			if (current_is_kswapd())
-				__count_zone_vm_events(PGSCAN_KSWAPD, zone,
-						       nr_scan);
-			else
-				__count_zone_vm_events(PGSCAN_DIRECT, zone,
-						       nr_scan);
-		} else {
-			nr_taken = mem_cgroup_isolate_pages(SWAP_CLUSTER_MAX,
-							&page_list, &nr_scan,
-							sc->order, mode,
-							zone, sc->mem_cgroup,
-							0, file);
-			/*
-			 * mem_cgroup_isolate_pages() keeps track of
-			 * scanned pages on its own.
-			 */
-		}
+	__mod_zone_page_state(zone, NR_ACTIVE_FILE,
+			      -count[LRU_ACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_INACTIVE_FILE,
+			      -count[LRU_INACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_ACTIVE_ANON,
+			      -count[LRU_ACTIVE_ANON]);
+	__mod_zone_page_state(zone, NR_INACTIVE_ANON,
+			      -count[LRU_INACTIVE_ANON]);
 
-		if (nr_taken == 0)
-			goto done;
+	*nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+	*nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, *nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, *nr_file);
 
-		nr_active = clear_active_flags(&page_list, count);
-		__count_vm_events(PGDEACTIVATE, nr_active);
+	reclaim_stat->recent_scanned[0] += *nr_anon;
+	reclaim_stat->recent_scanned[1] += *nr_file;
+}
 
-		__mod_zone_page_state(zone, NR_ACTIVE_FILE,
-						-count[LRU_ACTIVE_FILE]);
-		__mod_zone_page_state(zone, NR_INACTIVE_FILE,
-						-count[LRU_INACTIVE_FILE]);
-		__mod_zone_page_state(zone, NR_ACTIVE_ANON,
-						-count[LRU_ACTIVE_ANON]);
-		__mod_zone_page_state(zone, NR_INACTIVE_ANON,
-						-count[LRU_INACTIVE_ANON]);
+/*
+ * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
+ * of reclaimed pages
+ */
+static noinline_for_stack unsigned long
+shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
+			struct scan_control *sc, int priority, int file)
+{
+	LIST_HEAD(page_list);
+	unsigned long nr_scanned;
+	unsigned long nr_reclaimed = 0;
+	unsigned long nr_taken;
+	unsigned long nr_active;
+	unsigned long nr_anon;
+	unsigned long nr_file;
 
-		nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
-		nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
-		__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
-		__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
+	while (unlikely(too_many_isolated(zone, file, sc))) {
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
-		reclaim_stat->recent_scanned[0] += nr_anon;
-		reclaim_stat->recent_scanned[1] += nr_file;
+		/* We are about to die and free our memory. Return now. */
+		if (fatal_signal_pending(current))
+			return SWAP_CLUSTER_MAX;
+	}
 
-		spin_unlock_irq(&zone->lru_lock);
 
-		nr_scanned += nr_scan;
-		nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+	lru_add_drain();
+	spin_lock_irq(&zone->lru_lock);
 
+	if (scanning_global_lru(sc)) {
+		nr_taken = isolate_pages_global(nr_to_scan,
+			&page_list, &nr_scanned, sc->order,
+			sc->lumpy_reclaim_mode ?
+				ISOLATE_BOTH : ISOLATE_INACTIVE,
+			zone, 0, file);
+		zone->pages_scanned += nr_scanned;
+		if (current_is_kswapd())
+			__count_zone_vm_events(PGSCAN_KSWAPD, zone,
+					       nr_scanned);
+		else
+			__count_zone_vm_events(PGSCAN_DIRECT, zone,
+					       nr_scanned);
+	} else {
+		nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
+			&page_list, &nr_scanned, sc->order,
+			sc->lumpy_reclaim_mode ?
+				ISOLATE_BOTH : ISOLATE_INACTIVE,
+			zone, sc->mem_cgroup,
+			0, file);
 		/*
-		 * If we are direct reclaiming for contiguous pages and we do
-		 * not reclaim everything in the list, try again and wait
-		 * for IO to complete. This will stall high-order allocations
-		 * but that should be acceptable to the caller
+		 * mem_cgroup_isolate_pages() keeps track of
+		 * scanned pages on its own.
 		 */
-		if (nr_freed < nr_taken && !current_is_kswapd() &&
-		    sc->lumpy_reclaim_mode) {
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+	}
 
-			/*
-			 * The attempt at page out may have made some
-			 * of the pages active, mark them inactive again.
-			 */
-			nr_active = clear_active_flags(&page_list, count);
-			count_vm_events(PGDEACTIVATE, nr_active);
+	if (nr_taken == 0) {
+		spin_unlock_irq(&zone->lru_lock);
+		return 0;
+	}
 
-			nr_freed += shrink_page_list(&page_list, sc,
-							PAGEOUT_IO_SYNC);
-		}
+	update_isolated_counts(zone, sc, &nr_anon, &nr_file, &page_list);
 
-		nr_reclaimed += nr_freed;
+	spin_unlock_irq(&zone->lru_lock);
 
-		local_irq_disable();
-		if (current_is_kswapd())
-			__count_vm_events(KSWAPD_STEAL, nr_freed);
-		__count_zone_vm_events(PGSTEAL, zone, nr_freed);
+	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+
+	/*
+	 * If we are direct reclaiming for contiguous pages and we do
+	 * not reclaim everything in the list, try again and wait
+	 * for IO to complete. This will stall high-order allocations
+	 * but that should be acceptable to the caller
+	 */
+	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
+			sc->lumpy_reclaim_mode) {
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
-		spin_lock(&zone->lru_lock);
 		/*
-		 * Put back any unfreeable pages.
+		 * The attempt at page out may have made some
+		 * of the pages active, mark them inactive again.
 		 */
-		while (!list_empty(&page_list)) {
-			int lru;
-			page = lru_to_page(&page_list);
-			VM_BUG_ON(PageLRU(page));
-			list_del(&page->lru);
-			if (unlikely(!page_evictable(page, NULL))) {
-				spin_unlock_irq(&zone->lru_lock);
-				putback_lru_page(page);
-				spin_lock_irq(&zone->lru_lock);
-				continue;
-			}
-			SetPageLRU(page);
-			lru = page_lru(page);
-			add_page_to_lru_list(zone, page, lru);
-			if (is_active_lru(lru)) {
-				int file = is_file_lru(lru);
-				reclaim_stat->recent_rotated[file]++;
-			}
-			if (!pagevec_add(&pvec, page)) {
-				spin_unlock_irq(&zone->lru_lock);
-				__pagevec_release(&pvec);
-				spin_lock_irq(&zone->lru_lock);
-			}
-		}
-		__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
-		__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
+		nr_active = clear_active_flags(&page_list, NULL);
+		count_vm_events(PGDEACTIVATE, nr_active);
 
-  	} while (nr_scanned < max_scan);
+		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+	}
 
-done:
-	spin_unlock_irq(&zone->lru_lock);
-	pagevec_release(&pvec);
-	return nr_reclaimed;
-}
+	local_irq_disable();
+	if (current_is_kswapd())
+		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
+	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-/*
- * We are about to scan this zone at a certain priority level.  If that priority
- * level is smaller (ie: more urgent) than the previous priority, then note
- * that priority level within the zone.  This is done so that when the next
- * process comes in to scan this zone, it will immediately start out at this
- * priority level rather than having to build up its own scanning priority.
- * Here, this priority affects only the reclaim-mapped threshold.
- */
-static inline void note_zone_scanning_priority(struct zone *zone, int priority)
-{
-	if (priority < zone->prev_priority)
-		zone->prev_priority = priority;
+	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
+	return nr_reclaimed;
 }
 
 /*
@@ -1729,13 +1774,12 @@ static void shrink_zone(int priority, struct zone *zone,
 static bool shrink_zones(int priority, struct zonelist *zonelist,
 					struct scan_control *sc)
 {
-	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	struct zoneref *z;
 	struct zone *zone;
 	bool all_unreclaimable = true;
 
-	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
-					sc->nodemask) {
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
 			continue;
 		/*
@@ -1745,17 +1789,8 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
 		if (scanning_global_lru(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
-			note_zone_scanning_priority(zone, priority);
-
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;	/* Let kswapd poll it */
-		} else {
-			/*
-			 * Ignore cpuset limitation here. We just want to reduce
-			 * # of used pages by us regardless of memory shortage.
-			 */
-			mem_cgroup_note_reclaim_priority(sc->mem_cgroup,
-							priority);
 		}
 
 		shrink_zone(priority, zone, sc);
@@ -1787,10 +1822,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	bool all_unreclaimable;
 	unsigned long total_scanned = 0;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
-	unsigned long lru_pages = 0;
 	struct zoneref *z;
 	struct zone *zone;
-	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	unsigned long writeback_threshold;
 
 	get_mems_allowed();
@@ -1798,18 +1831,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 	if (scanning_global_lru(sc))
 		count_vm_event(ALLOCSTALL);
-	/*
-	 * mem_cgroup will not do shrink_slab.
-	 */
-	if (scanning_global_lru(sc)) {
-		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-
-			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-				continue;
-
-			lru_pages += zone_reclaimable_pages(zone);
-		}
-	}
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		sc->nr_scanned = 0;
@@ -1821,6 +1842,15 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 * over limit cgroups
 		 */
 		if (scanning_global_lru(sc)) {
+			unsigned long lru_pages = 0;
+			for_each_zone_zonelist(zone, z, zonelist,
+					gfp_zone(sc->gfp_mask)) {
+				if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
+					continue;
+
+				lru_pages += zone_reclaimable_pages(zone);
+			}
+
 			shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages);
 			if (reclaim_state) {
 				sc->nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -1861,17 +1891,6 @@ out:
 	if (priority < 0)
 		priority = 0;
 
-	if (scanning_global_lru(sc)) {
-		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-
-			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-				continue;
-
-			zone->prev_priority = priority;
-		}
-	} else
-		mem_cgroup_record_reclaim_priority(sc->mem_cgroup, priority);
-
 	delayacct_freepages_end();
 	put_mems_allowed();
 
@@ -1888,6 +1907,7 @@ out:
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 				gfp_t gfp_mask, nodemask_t *nodemask)
 {
+	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
 		.may_writepage = !laptop_mode,
@@ -1900,7 +1920,15 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.nodemask = nodemask,
 	};
 
-	return do_try_to_free_pages(zonelist, &sc);
+	trace_mm_vmscan_direct_reclaim_begin(order,
+				sc.may_writepage,
+				gfp_mask);
+
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+
+	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
+
+	return nr_reclaimed;
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
@@ -2028,22 +2056,12 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 		.order = order,
 		.mem_cgroup = NULL,
 	};
-	/*
-	 * temp_priority is used to remember the scanning priority at which
-	 * this zone was successfully refilled to
-	 * free_pages == high_wmark_pages(zone).
-	 */
-	int temp_priority[MAX_NR_ZONES];
-
 loop_again:
 	total_scanned = 0;
 	sc.nr_reclaimed = 0;
 	sc.may_writepage = !laptop_mode;
 	count_vm_event(PAGEOUTRUN);
 
-	for (i = 0; i < pgdat->nr_zones; i++)
-		temp_priority[i] = DEF_PRIORITY;
-
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 		unsigned long lru_pages = 0;
@@ -2111,9 +2129,7 @@ loop_again:
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;
 
-			temp_priority[i] = priority;
 			sc.nr_scanned = 0;
-			note_zone_scanning_priority(zone, priority);
 
 			nid = pgdat->node_id;
 			zid = zone_idx(zone);
@@ -2186,16 +2202,6 @@ loop_again:
 			break;
 	}
 out:
-	/*
-	 * Note within each zone the priority level at which this zone was
-	 * brought into a happy state.  So that the next thread which scans this
-	 * zone will start out at that priority level.
-	 */
-	for (i = 0; i < pgdat->nr_zones; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
-		zone->prev_priority = temp_priority[i];
-	}
 	if (!all_zones_ok) {
 		cond_resched();
 
@@ -2299,9 +2305,10 @@ static int kswapd(void *p)
 				 * premature sleep. If not, then go fully
 				 * to sleep until explicitly woken up
 				 */
-				if (!sleeping_prematurely(pgdat, order, remaining))
+				if (!sleeping_prematurely(pgdat, order, remaining)) {
+					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 					schedule();
-				else {
+				} else {
 					if (remaining)
 						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
 					else
@@ -2321,8 +2328,10 @@ static int kswapd(void *p)
 		 * We can speed up thawing tasks if we don't call balance_pgdat
 		 * after returning from the refrigerator
 		 */
-		if (!ret)
+		if (!ret) {
+			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
 			balance_pgdat(pgdat, order);
+		}
 	}
 	return 0;
 }
@@ -2342,6 +2351,7 @@ void wakeup_kswapd(struct zone *zone, int order)
 		return;
 	if (pgdat->kswapd_max_order < order)
 		pgdat->kswapd_max_order = order;
+	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 		return;
 	if (!waitqueue_active(&pgdat->kswapd_wait))
@@ -2611,7 +2621,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		 */
 		priority = ZONE_RECLAIM_PRIORITY;
 		do {
-			note_zone_scanning_priority(zone, priority);
 			shrink_zone(priority, zone, &sc);
 			priority--;
 		} while (priority >= 0 && sc.nr_reclaimed < nr_pages);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7759941..5c0b1b6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -853,11 +853,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 	}
 	seq_printf(m,
 		   "\n  all_unreclaimable: %u"
-		   "\n  prev_priority:     %i"
 		   "\n  start_pfn:         %lu"
 		   "\n  inactive_ratio:    %u",
 		   zone->all_unreclaimable,
-		   zone->prev_priority,
 		   zone->zone_start_pfn,
 		   zone->inactive_ratio);
 	seq_putc(m, '\n');
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 2/6] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages
  2010-07-30 13:36 ` Mel Gorman
@ 2010-07-30 13:36   ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2010-07-30 13:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrea Arcangeli, Mel Gorman

It is useful to distinguish between IO for anon and file pages. This
patch updates
vmscan-tracing-add-trace-event-when-a-page-is-written.patch to include
that information. The patches can be merged together.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/vmscan.h |   30 ++++++++++++++++++++++++------
 mm/vmscan.c                   |    2 +-
 2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index f2da66a..69789dc 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -8,6 +8,24 @@
 #include <linux/tracepoint.h>
 #include "gfpflags.h"
 
+#define RECLAIM_WB_ANON		0x0001u
+#define RECLAIM_WB_FILE		0x0002u
+#define RECLAIM_WB_SYNC		0x0004u
+#define RECLAIM_WB_ASYNC	0x0008u
+
+#define show_reclaim_flags(flags)				\
+	(flags) ? __print_flags(flags, "|",			\
+		{RECLAIM_WB_ANON,	"RECLAIM_WB_ANON"},	\
+		{RECLAIM_WB_FILE,	"RECLAIM_WB_FILE"},	\
+		{RECLAIM_WB_SYNC,	"RECLAIM_WB_SYNC"},	\
+		{RECLAIM_WB_ASYNC,	"RECLAIM_WB_ASYNC"}	\
+		) : "RECLAIM_WB_NONE"
+
+#define trace_reclaim_flags(page, sync) ( \
+	(page_is_file_cache(page) ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
+	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC)   \
+	)
+
 TRACE_EVENT(mm_vmscan_kswapd_sleep,
 
 	TP_PROTO(int nid),
@@ -158,24 +176,24 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
 TRACE_EVENT(mm_vmscan_writepage,
 
 	TP_PROTO(struct page *page,
-		int sync_io),
+		int reclaim_flags),
 
-	TP_ARGS(page, sync_io),
+	TP_ARGS(page, reclaim_flags),
 
 	TP_STRUCT__entry(
 		__field(struct page *, page)
-		__field(int, sync_io)
+		__field(int, reclaim_flags)
 	),
 
 	TP_fast_assign(
 		__entry->page = page;
-		__entry->sync_io = sync_io;
+		__entry->reclaim_flags = reclaim_flags;
 	),
 
-	TP_printk("page=%p pfn=%lu sync_io=%d",
+	TP_printk("page=%p pfn=%lu flags=%s",
 		__entry->page,
 		page_to_pfn(__entry->page),
-		__entry->sync_io)
+		show_reclaim_flags(__entry->reclaim_flags))
 );
 
 #endif /* _TRACE_VMSCAN_H */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 63447ff..d83812a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -402,7 +402,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			ClearPageReclaim(page);
 		}
 		trace_mm_vmscan_writepage(page,
-			sync_writeback == PAGEOUT_IO_SYNC);
+			trace_reclaim_flags(page, sync_writeback));
 		inc_zone_page_state(page, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
 	}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 2/6] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages
@ 2010-07-30 13:36   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2010-07-30 13:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrea Arcangeli, Mel Gorman

It is useful to distinguish between IO for anon and file pages. This
patch updates
vmscan-tracing-add-trace-event-when-a-page-is-written.patch to include
that information. The patches can be merged together.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/vmscan.h |   30 ++++++++++++++++++++++++------
 mm/vmscan.c                   |    2 +-
 2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index f2da66a..69789dc 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -8,6 +8,24 @@
 #include <linux/tracepoint.h>
 #include "gfpflags.h"
 
+#define RECLAIM_WB_ANON		0x0001u
+#define RECLAIM_WB_FILE		0x0002u
+#define RECLAIM_WB_SYNC		0x0004u
+#define RECLAIM_WB_ASYNC	0x0008u
+
+#define show_reclaim_flags(flags)				\
+	(flags) ? __print_flags(flags, "|",			\
+		{RECLAIM_WB_ANON,	"RECLAIM_WB_ANON"},	\
+		{RECLAIM_WB_FILE,	"RECLAIM_WB_FILE"},	\
+		{RECLAIM_WB_SYNC,	"RECLAIM_WB_SYNC"},	\
+		{RECLAIM_WB_ASYNC,	"RECLAIM_WB_ASYNC"}	\
+		) : "RECLAIM_WB_NONE"
+
+#define trace_reclaim_flags(page, sync) ( \
+	(page_is_file_cache(page) ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
+	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC)   \
+	)
+
 TRACE_EVENT(mm_vmscan_kswapd_sleep,
 
 	TP_PROTO(int nid),
@@ -158,24 +176,24 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
 TRACE_EVENT(mm_vmscan_writepage,
 
 	TP_PROTO(struct page *page,
-		int sync_io),
+		int reclaim_flags),
 
-	TP_ARGS(page, sync_io),
+	TP_ARGS(page, reclaim_flags),
 
 	TP_STRUCT__entry(
 		__field(struct page *, page)
-		__field(int, sync_io)
+		__field(int, reclaim_flags)
 	),
 
 	TP_fast_assign(
 		__entry->page = page;
-		__entry->sync_io = sync_io;
+		__entry->reclaim_flags = reclaim_flags;
 	),
 
-	TP_printk("page=%p pfn=%lu sync_io=%d",
+	TP_printk("page=%p pfn=%lu flags=%s",
 		__entry->page,
 		page_to_pfn(__entry->page),
-		__entry->sync_io)
+		show_reclaim_flags(__entry->reclaim_flags))
 );
 
 #endif /* _TRACE_VMSCAN_H */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 63447ff..d83812a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -402,7 +402,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			ClearPageReclaim(page);
 		}
 		trace_mm_vmscan_writepage(page,
-			sync_writeback == PAGEOUT_IO_SYNC);
+			trace_reclaim_flags(page, sync_writeback));
 		inc_zone_page_state(page, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
 	}
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 3/6] vmscan: tracing: Update post-processing script to distinguish between anon and file IO from page reclaim
  2010-07-30 13:36 ` Mel Gorman
@ 2010-07-30 13:36   ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2010-07-30 13:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrea Arcangeli, Mel Gorman

It is useful to distinguish between IO for anon and file pages. This patch
updates
vmscan-tracing-add-a-postprocessing-script-for-reclaim-related-ftrace-events.patch
so the post-processing script can handle the additional information.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 .../trace/postprocess/trace-vmscan-postprocess.pl  |   96 +++++++++++++-------
 1 files changed, 64 insertions(+), 32 deletions(-)

diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
index d1ddc33..f87f56e 100644
--- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -21,9 +21,12 @@ use constant MM_VMSCAN_KSWAPD_SLEEP		=> 4;
 use constant MM_VMSCAN_LRU_SHRINK_ACTIVE	=> 5;
 use constant MM_VMSCAN_LRU_SHRINK_INACTIVE	=> 6;
 use constant MM_VMSCAN_LRU_ISOLATE		=> 7;
-use constant MM_VMSCAN_WRITEPAGE_SYNC		=> 8;
-use constant MM_VMSCAN_WRITEPAGE_ASYNC		=> 9;
-use constant EVENT_UNKNOWN			=> 10;
+use constant MM_VMSCAN_WRITEPAGE_FILE_SYNC	=> 8;
+use constant MM_VMSCAN_WRITEPAGE_ANON_SYNC	=> 9;
+use constant MM_VMSCAN_WRITEPAGE_FILE_ASYNC	=> 10;
+use constant MM_VMSCAN_WRITEPAGE_ANON_ASYNC	=> 11;
+use constant MM_VMSCAN_WRITEPAGE_ASYNC		=> 12;
+use constant EVENT_UNKNOWN			=> 13;
 
 # Per-order events
 use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER => 11;
@@ -55,9 +58,11 @@ my $opt_read_procstat;
 my $total_wakeup_kswapd;
 my ($total_direct_reclaim, $total_direct_nr_scanned);
 my ($total_direct_latency, $total_kswapd_latency);
-my ($total_direct_writepage_sync, $total_direct_writepage_async);
+my ($total_direct_writepage_file_sync, $total_direct_writepage_file_async);
+my ($total_direct_writepage_anon_sync, $total_direct_writepage_anon_async);
 my ($total_kswapd_nr_scanned, $total_kswapd_wake);
-my ($total_kswapd_writepage_sync, $total_kswapd_writepage_async);
+my ($total_kswapd_writepage_file_sync, $total_kswapd_writepage_file_async);
+my ($total_kswapd_writepage_anon_sync, $total_kswapd_writepage_anon_async);
 
 # Catch sigint and exit on request
 my $sigint_report = 0;
@@ -101,7 +106,7 @@ my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
 my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)';
 my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)';
 my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
-my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) sync_io=([0-9]*)';
+my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) flags=([A-Z_|]*)';
 
 # Dyanically discovered regex
 my $regex_direct_begin;
@@ -209,7 +214,7 @@ $regex_lru_shrink_active = generate_traceevent_regex(
 $regex_writepage = generate_traceevent_regex(
 			"vmscan/mm_vmscan_writepage",
 			$regex_writepage_default,
-			"page", "pfn", "sync_io");
+			"page", "pfn", "flags");
 
 sub read_statline($) {
 	my $pid = $_[0];
@@ -379,11 +384,27 @@ EVENT_PROCESS:
 				next;
 			}
 
-			my $sync_io = $3;
+			my $flags = $3;
+			my $file = 0;
+			my $sync_io = 0;
+			if ($flags =~ /RECLAIM_WB_FILE/) {
+				$file = 1;
+			}
+			if ($flags =~ /RECLAIM_WB_SYNC/) {
+				$sync_io = 1;
+			}
 			if ($sync_io) {
-				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}++;
+				if ($file) {
+					$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}++;
+				} else {
+					$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}++;
+				}
 			} else {
-				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}++;
+				if ($file) {
+					$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}++;
+				} else {
+					$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}++;
+				}
 			}
 		} else {
 			$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
@@ -427,7 +448,7 @@ sub dump_stats {
 		while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] ||
 			defined $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) {
 
-			if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
+			if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) { 
 				printf("%s ", $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) if !$opt_ignorepid;
 				my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
 				$total_direct_latency += $latency;
@@ -454,8 +475,11 @@ sub dump_stats {
 		$total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
 		$total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
 		$total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
-		$total_direct_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
-		$total_direct_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+		$total_direct_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
+		$total_direct_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
+		$total_direct_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
+
+		$total_direct_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
 
 		my $index = 0;
 		my $this_reclaim_delay = 0;
@@ -470,8 +494,8 @@ sub dump_stats {
 			$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
 			$stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
 			$stats{$process_pid}->{HIGH_NR_SCANNED},
-			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
-			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC},
 			$this_reclaim_delay / 1000);
 
 		if ($stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
@@ -515,16 +539,18 @@ sub dump_stats {
 
 		$total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
 		$total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
-		$total_kswapd_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
-		$total_kswapd_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+		$total_kswapd_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
+		$total_kswapd_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
+		$total_kswapd_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
+		$total_kswapd_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
 
 		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8i %8u",
 			$process_pid,
 			$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
 			$stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
 			$stats{$process_pid}->{HIGH_NR_SCANNED},
-			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
-			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC});
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC});
 
 		if ($stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
 			print "      ";
@@ -551,18 +577,22 @@ sub dump_stats {
 	$total_direct_latency /= 1000;
 	$total_kswapd_latency /= 1000;
 	print "\nSummary\n";
-	print "Direct reclaims:     		$total_direct_reclaim\n";
-	print "Direct reclaim pages scanned:	$total_direct_nr_scanned\n";
-	print "Direct reclaim write sync I/O:	$total_direct_writepage_sync\n";
-	print "Direct reclaim write async I/O:	$total_direct_writepage_async\n";
-	print "Wake kswapd requests:		$total_wakeup_kswapd\n";
-	printf "Time stalled direct reclaim: 	%-1.2f ms\n", $total_direct_latency;
+	print "Direct reclaims:     			$total_direct_reclaim\n";
+	print "Direct reclaim pages scanned:		$total_direct_nr_scanned\n";
+	print "Direct reclaim write file sync I/O:	$total_direct_writepage_file_sync\n";
+	print "Direct reclaim write anon sync I/O:	$total_direct_writepage_anon_sync\n";
+	print "Direct reclaim write file async I/O:	$total_direct_writepage_file_async\n";
+	print "Direct reclaim write anon async I/O:	$total_direct_writepage_anon_async\n";
+	print "Wake kswapd requests:			$total_wakeup_kswapd\n";
+	printf "Time stalled direct reclaim: 		%-1.2f ms\n", $total_direct_latency;
 	print "\n";
-	print "Kswapd wakeups:			$total_kswapd_wake\n";
-	print "Kswapd pages scanned:		$total_kswapd_nr_scanned\n";
-	print "Kswapd reclaim write sync I/O:	$total_kswapd_writepage_sync\n";
-	print "Kswapd reclaim write async I/O:	$total_kswapd_writepage_async\n";
-	printf "Time kswapd awake:		%-1.2f ms\n", $total_kswapd_latency;
+	print "Kswapd wakeups:				$total_kswapd_wake\n";
+	print "Kswapd pages scanned:			$total_kswapd_nr_scanned\n";
+	print "Kswapd reclaim write file sync I/O:	$total_kswapd_writepage_file_sync\n";
+	print "Kswapd reclaim write anon sync I/O:	$total_kswapd_writepage_anon_sync\n";
+	print "Kswapd reclaim write file async I/O:	$total_kswapd_writepage_file_async\n";
+	print "Kswapd reclaim write anon async I/O:	$total_kswapd_writepage_anon_async\n";
+	printf "Time kswapd awake:			%-1.2f ms\n", $total_kswapd_latency;
 }
 
 sub aggregate_perprocesspid() {
@@ -582,8 +612,10 @@ sub aggregate_perprocesspid() {
 		$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
 		$perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
 		$perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
-		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
-		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
 
 		for (my $order = 0; $order < 20; $order++) {
 			$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 3/6] vmscan: tracing: Update post-processing script to distinguish between anon and file IO from page reclaim
@ 2010-07-30 13:36   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2010-07-30 13:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrea Arcangeli, Mel Gorman

It is useful to distinguish between IO for anon and file pages. This patch
updates
vmscan-tracing-add-a-postprocessing-script-for-reclaim-related-ftrace-events.patch
so the post-processing script can handle the additional information.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 .../trace/postprocess/trace-vmscan-postprocess.pl  |   96 +++++++++++++-------
 1 files changed, 64 insertions(+), 32 deletions(-)

diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
index d1ddc33..f87f56e 100644
--- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -21,9 +21,12 @@ use constant MM_VMSCAN_KSWAPD_SLEEP		=> 4;
 use constant MM_VMSCAN_LRU_SHRINK_ACTIVE	=> 5;
 use constant MM_VMSCAN_LRU_SHRINK_INACTIVE	=> 6;
 use constant MM_VMSCAN_LRU_ISOLATE		=> 7;
-use constant MM_VMSCAN_WRITEPAGE_SYNC		=> 8;
-use constant MM_VMSCAN_WRITEPAGE_ASYNC		=> 9;
-use constant EVENT_UNKNOWN			=> 10;
+use constant MM_VMSCAN_WRITEPAGE_FILE_SYNC	=> 8;
+use constant MM_VMSCAN_WRITEPAGE_ANON_SYNC	=> 9;
+use constant MM_VMSCAN_WRITEPAGE_FILE_ASYNC	=> 10;
+use constant MM_VMSCAN_WRITEPAGE_ANON_ASYNC	=> 11;
+use constant MM_VMSCAN_WRITEPAGE_ASYNC		=> 12;
+use constant EVENT_UNKNOWN			=> 13;
 
 # Per-order events
 use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER => 11;
@@ -55,9 +58,11 @@ my $opt_read_procstat;
 my $total_wakeup_kswapd;
 my ($total_direct_reclaim, $total_direct_nr_scanned);
 my ($total_direct_latency, $total_kswapd_latency);
-my ($total_direct_writepage_sync, $total_direct_writepage_async);
+my ($total_direct_writepage_file_sync, $total_direct_writepage_file_async);
+my ($total_direct_writepage_anon_sync, $total_direct_writepage_anon_async);
 my ($total_kswapd_nr_scanned, $total_kswapd_wake);
-my ($total_kswapd_writepage_sync, $total_kswapd_writepage_async);
+my ($total_kswapd_writepage_file_sync, $total_kswapd_writepage_file_async);
+my ($total_kswapd_writepage_anon_sync, $total_kswapd_writepage_anon_async);
 
 # Catch sigint and exit on request
 my $sigint_report = 0;
@@ -101,7 +106,7 @@ my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
 my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)';
 my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)';
 my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
-my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) sync_io=([0-9]*)';
+my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) flags=([A-Z_|]*)';
 
 # Dyanically discovered regex
 my $regex_direct_begin;
@@ -209,7 +214,7 @@ $regex_lru_shrink_active = generate_traceevent_regex(
 $regex_writepage = generate_traceevent_regex(
 			"vmscan/mm_vmscan_writepage",
 			$regex_writepage_default,
-			"page", "pfn", "sync_io");
+			"page", "pfn", "flags");
 
 sub read_statline($) {
 	my $pid = $_[0];
@@ -379,11 +384,27 @@ EVENT_PROCESS:
 				next;
 			}
 
-			my $sync_io = $3;
+			my $flags = $3;
+			my $file = 0;
+			my $sync_io = 0;
+			if ($flags =~ /RECLAIM_WB_FILE/) {
+				$file = 1;
+			}
+			if ($flags =~ /RECLAIM_WB_SYNC/) {
+				$sync_io = 1;
+			}
 			if ($sync_io) {
-				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}++;
+				if ($file) {
+					$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}++;
+				} else {
+					$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}++;
+				}
 			} else {
-				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}++;
+				if ($file) {
+					$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}++;
+				} else {
+					$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}++;
+				}
 			}
 		} else {
 			$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
@@ -427,7 +448,7 @@ sub dump_stats {
 		while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] ||
 			defined $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) {
 
-			if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
+			if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) { 
 				printf("%s ", $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) if !$opt_ignorepid;
 				my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
 				$total_direct_latency += $latency;
@@ -454,8 +475,11 @@ sub dump_stats {
 		$total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
 		$total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
 		$total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
-		$total_direct_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
-		$total_direct_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+		$total_direct_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
+		$total_direct_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
+		$total_direct_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
+
+		$total_direct_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
 
 		my $index = 0;
 		my $this_reclaim_delay = 0;
@@ -470,8 +494,8 @@ sub dump_stats {
 			$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
 			$stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
 			$stats{$process_pid}->{HIGH_NR_SCANNED},
-			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
-			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC},
 			$this_reclaim_delay / 1000);
 
 		if ($stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
@@ -515,16 +539,18 @@ sub dump_stats {
 
 		$total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
 		$total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
-		$total_kswapd_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
-		$total_kswapd_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+		$total_kswapd_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
+		$total_kswapd_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
+		$total_kswapd_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
+		$total_kswapd_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
 
 		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8i %8u",
 			$process_pid,
 			$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
 			$stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
 			$stats{$process_pid}->{HIGH_NR_SCANNED},
-			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
-			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC});
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC});
 
 		if ($stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
 			print "      ";
@@ -551,18 +577,22 @@ sub dump_stats {
 	$total_direct_latency /= 1000;
 	$total_kswapd_latency /= 1000;
 	print "\nSummary\n";
-	print "Direct reclaims:     		$total_direct_reclaim\n";
-	print "Direct reclaim pages scanned:	$total_direct_nr_scanned\n";
-	print "Direct reclaim write sync I/O:	$total_direct_writepage_sync\n";
-	print "Direct reclaim write async I/O:	$total_direct_writepage_async\n";
-	print "Wake kswapd requests:		$total_wakeup_kswapd\n";
-	printf "Time stalled direct reclaim: 	%-1.2f ms\n", $total_direct_latency;
+	print "Direct reclaims:     			$total_direct_reclaim\n";
+	print "Direct reclaim pages scanned:		$total_direct_nr_scanned\n";
+	print "Direct reclaim write file sync I/O:	$total_direct_writepage_file_sync\n";
+	print "Direct reclaim write anon sync I/O:	$total_direct_writepage_anon_sync\n";
+	print "Direct reclaim write file async I/O:	$total_direct_writepage_file_async\n";
+	print "Direct reclaim write anon async I/O:	$total_direct_writepage_anon_async\n";
+	print "Wake kswapd requests:			$total_wakeup_kswapd\n";
+	printf "Time stalled direct reclaim: 		%-1.2f ms\n", $total_direct_latency;
 	print "\n";
-	print "Kswapd wakeups:			$total_kswapd_wake\n";
-	print "Kswapd pages scanned:		$total_kswapd_nr_scanned\n";
-	print "Kswapd reclaim write sync I/O:	$total_kswapd_writepage_sync\n";
-	print "Kswapd reclaim write async I/O:	$total_kswapd_writepage_async\n";
-	printf "Time kswapd awake:		%-1.2f ms\n", $total_kswapd_latency;
+	print "Kswapd wakeups:				$total_kswapd_wake\n";
+	print "Kswapd pages scanned:			$total_kswapd_nr_scanned\n";
+	print "Kswapd reclaim write file sync I/O:	$total_kswapd_writepage_file_sync\n";
+	print "Kswapd reclaim write anon sync I/O:	$total_kswapd_writepage_anon_sync\n";
+	print "Kswapd reclaim write file async I/O:	$total_kswapd_writepage_file_async\n";
+	print "Kswapd reclaim write anon async I/O:	$total_kswapd_writepage_anon_async\n";
+	printf "Time kswapd awake:			%-1.2f ms\n", $total_kswapd_latency;
 }
 
 sub aggregate_perprocesspid() {
@@ -582,8 +612,10 @@ sub aggregate_perprocesspid() {
 		$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
 		$perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
 		$perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
-		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
-		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
 
 		for (my $order = 0; $order < 20; $order++) {
 			$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 4/6] vmscan: tracing: Correct units in post-processing script
  2010-07-30 13:36 ` Mel Gorman
@ 2010-07-30 13:36   ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2010-07-30 13:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrea Arcangeli, Mel Gorman

The post-processing script is reporting the wrong units. Correct it.  This
patch updates vmscan-tracing-add-trace-event-when-a-page-is-written.patch
to include that information. The patches can be merged together.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 .../trace/postprocess/trace-vmscan-postprocess.pl  |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
index f87f56e..f1b70a8 100644
--- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -584,7 +584,7 @@ sub dump_stats {
 	print "Direct reclaim write file async I/O:	$total_direct_writepage_file_async\n";
 	print "Direct reclaim write anon async I/O:	$total_direct_writepage_anon_async\n";
 	print "Wake kswapd requests:			$total_wakeup_kswapd\n";
-	printf "Time stalled direct reclaim: 		%-1.2f ms\n", $total_direct_latency;
+	printf "Time stalled direct reclaim: 		%-1.2f seconds\n", $total_direct_latency;
 	print "\n";
 	print "Kswapd wakeups:				$total_kswapd_wake\n";
 	print "Kswapd pages scanned:			$total_kswapd_nr_scanned\n";
@@ -592,7 +592,7 @@ sub dump_stats {
 	print "Kswapd reclaim write anon sync I/O:	$total_kswapd_writepage_anon_sync\n";
 	print "Kswapd reclaim write file async I/O:	$total_kswapd_writepage_file_async\n";
 	print "Kswapd reclaim write anon async I/O:	$total_kswapd_writepage_anon_async\n";
-	printf "Time kswapd awake:			%-1.2f ms\n", $total_kswapd_latency;
+	printf "Time kswapd awake:			%-1.2f seconds\n", $total_kswapd_latency;
 }
 
 sub aggregate_perprocesspid() {
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 4/6] vmscan: tracing: Correct units in post-processing script
@ 2010-07-30 13:36   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2010-07-30 13:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrea Arcangeli, Mel Gorman

The post-processing script is reporting the wrong units. Correct it.  This
patch updates vmscan-tracing-add-trace-event-when-a-page-is-written.patch
to include that information. The patches can be merged together.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 .../trace/postprocess/trace-vmscan-postprocess.pl  |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
index f87f56e..f1b70a8 100644
--- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -584,7 +584,7 @@ sub dump_stats {
 	print "Direct reclaim write file async I/O:	$total_direct_writepage_file_async\n";
 	print "Direct reclaim write anon async I/O:	$total_direct_writepage_anon_async\n";
 	print "Wake kswapd requests:			$total_wakeup_kswapd\n";
-	printf "Time stalled direct reclaim: 		%-1.2f ms\n", $total_direct_latency;
+	printf "Time stalled direct reclaim: 		%-1.2f seconds\n", $total_direct_latency;
 	print "\n";
 	print "Kswapd wakeups:				$total_kswapd_wake\n";
 	print "Kswapd pages scanned:			$total_kswapd_nr_scanned\n";
@@ -592,7 +592,7 @@ sub dump_stats {
 	print "Kswapd reclaim write anon sync I/O:	$total_kswapd_writepage_anon_sync\n";
 	print "Kswapd reclaim write file async I/O:	$total_kswapd_writepage_file_async\n";
 	print "Kswapd reclaim write anon async I/O:	$total_kswapd_writepage_anon_async\n";
-	printf "Time kswapd awake:			%-1.2f ms\n", $total_kswapd_latency;
+	printf "Time kswapd awake:			%-1.2f seconds\n", $total_kswapd_latency;
 }
 
 sub aggregate_perprocesspid() {
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 5/6] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-30 13:36 ` Mel Gorman
@ 2010-07-30 13:36   ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2010-07-30 13:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrea Arcangeli, Mel Gorman

When memory is under enough pressure, a process may enter direct
reclaim to free pages in the same manner kswapd does. If a dirty page is
encountered during the scan, this page is written to backing storage using
mapping->writepage. This can result in very deep call stacks, particularly
if the target storage or filesystem are complex. It has already been observed
on XFS that the stack overflows but the problem is not XFS-specific.

This patch prevents direct reclaim writing back filesystem pages by checking
if current is kswapd or the page is anonymous before writing back.  If the
dirty pages cannot be written back, they are placed back on the LRU lists
for either background writing by the BDI threads or kswapd. If in direct
lumpy reclaim and dirty pages are encountered, the process will stall for
the background flusher before trying to reclaim the pages again.

As the call-chain for writing anonymous pages is not expected to be deep
and they are not cleaned by flusher threads, anonymous pages are still
written back in direct reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c |   69 ++++++++++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 54 insertions(+), 15 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d83812a..2d2b588 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -139,6 +139,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
 #define scanning_global_lru(sc)	(1)
 #endif
 
+/* Direct lumpy reclaim waits up to five seconds for background cleaning */
+#define MAX_SWAP_CLEAN_WAIT 50
+
 static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
 						  struct scan_control *sc)
 {
@@ -645,11 +648,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
 					struct scan_control *sc,
-					enum pageout_io sync_writeback)
+					enum pageout_io sync_writeback,
+					unsigned long *nr_still_dirty)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
 	int pgactivate = 0;
+	unsigned long nr_dirty = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
@@ -743,6 +748,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		if (PageDirty(page)) {
+			/*
+			 * Only kswapd can writeback filesystem pages to
+			 * avoid risk of stack overflow
+			 */
+			if (page_is_file_cache(page) && !current_is_kswapd()) {
+				nr_dirty++;
+				goto keep_locked;
+			}
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
@@ -860,6 +874,8 @@ keep:
 	free_page_list(&free_pages);
 
 	list_splice(&ret_pages, page_list);
+
+	*nr_still_dirty = nr_dirty;
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
@@ -1242,12 +1258,14 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 			struct scan_control *sc, int priority, int file)
 {
 	LIST_HEAD(page_list);
+	LIST_HEAD(putback_list);
 	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_taken;
 	unsigned long nr_active;
 	unsigned long nr_anon;
 	unsigned long nr_file;
+	unsigned long nr_dirty;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1296,28 +1314,49 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
+								&nr_dirty);
 
 	/*
-	 * If we are direct reclaiming for contiguous pages and we do
+	 * If specific pages are needed such as with direct reclaiming
+	 * for contiguous pages or for memory containers and we do
 	 * not reclaim everything in the list, try again and wait
-	 * for IO to complete. This will stall high-order allocations
-	 * but that should be acceptable to the caller
+	 * for IO to complete. This will stall callers that require
+	 * specific pages but it should be acceptable to the caller
 	 */
-	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
-			sc->lumpy_reclaim_mode) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+	if (sc->may_writepage && !current_is_kswapd() &&
+			(sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
+		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
 
-		/*
-		 * The attempt at page out may have made some
-		 * of the pages active, mark them inactive again.
-		 */
-		nr_active = clear_active_flags(&page_list, NULL);
-		count_vm_events(PGDEACTIVATE, nr_active);
+		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
+			struct page *page, *tmp;
+
+			/* Take off the clean pages marked for activation */
+			list_for_each_entry_safe(page, tmp, &page_list, lru) {
+				if (PageDirty(page) || PageWriteback(page))
+					continue;
+
+				list_del(&page->lru);
+				list_add(&page->lru, &putback_list);
+			}
+
+			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
+			congestion_wait(BLK_RW_ASYNC, HZ/10);
 
-		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+			/*
+			 * The attempt at page out may have made some
+			 * of the pages active, mark them inactive again.
+			 */
+			nr_active = clear_active_flags(&page_list, NULL);
+			count_vm_events(PGDEACTIVATE, nr_active);
+
+			nr_reclaimed += shrink_page_list(&page_list, sc,
+						PAGEOUT_IO_SYNC, &nr_dirty);
+		}
 	}
 
+	list_splice(&putback_list, &page_list);
+
 	local_irq_disable();
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 5/6] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-07-30 13:36   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2010-07-30 13:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrea Arcangeli, Mel Gorman

When memory is under enough pressure, a process may enter direct
reclaim to free pages in the same manner kswapd does. If a dirty page is
encountered during the scan, this page is written to backing storage using
mapping->writepage. This can result in very deep call stacks, particularly
if the target storage or filesystem are complex. It has already been observed
on XFS that the stack overflows but the problem is not XFS-specific.

This patch prevents direct reclaim writing back filesystem pages by checking
if current is kswapd or the page is anonymous before writing back.  If the
dirty pages cannot be written back, they are placed back on the LRU lists
for either background writing by the BDI threads or kswapd. If in direct
lumpy reclaim and dirty pages are encountered, the process will stall for
the background flusher before trying to reclaim the pages again.

As the call-chain for writing anonymous pages is not expected to be deep
and they are not cleaned by flusher threads, anonymous pages are still
written back in direct reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c |   69 ++++++++++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 54 insertions(+), 15 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d83812a..2d2b588 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -139,6 +139,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
 #define scanning_global_lru(sc)	(1)
 #endif
 
+/* Direct lumpy reclaim waits up to five seconds for background cleaning */
+#define MAX_SWAP_CLEAN_WAIT 50
+
 static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
 						  struct scan_control *sc)
 {
@@ -645,11 +648,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
 					struct scan_control *sc,
-					enum pageout_io sync_writeback)
+					enum pageout_io sync_writeback,
+					unsigned long *nr_still_dirty)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
 	int pgactivate = 0;
+	unsigned long nr_dirty = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
@@ -743,6 +748,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		if (PageDirty(page)) {
+			/*
+			 * Only kswapd can writeback filesystem pages to
+			 * avoid risk of stack overflow
+			 */
+			if (page_is_file_cache(page) && !current_is_kswapd()) {
+				nr_dirty++;
+				goto keep_locked;
+			}
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
@@ -860,6 +874,8 @@ keep:
 	free_page_list(&free_pages);
 
 	list_splice(&ret_pages, page_list);
+
+	*nr_still_dirty = nr_dirty;
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
@@ -1242,12 +1258,14 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 			struct scan_control *sc, int priority, int file)
 {
 	LIST_HEAD(page_list);
+	LIST_HEAD(putback_list);
 	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_taken;
 	unsigned long nr_active;
 	unsigned long nr_anon;
 	unsigned long nr_file;
+	unsigned long nr_dirty;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1296,28 +1314,49 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
+								&nr_dirty);
 
 	/*
-	 * If we are direct reclaiming for contiguous pages and we do
+	 * If specific pages are needed such as with direct reclaiming
+	 * for contiguous pages or for memory containers and we do
 	 * not reclaim everything in the list, try again and wait
-	 * for IO to complete. This will stall high-order allocations
-	 * but that should be acceptable to the caller
+	 * for IO to complete. This will stall callers that require
+	 * specific pages but it should be acceptable to the caller
 	 */
-	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
-			sc->lumpy_reclaim_mode) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+	if (sc->may_writepage && !current_is_kswapd() &&
+			(sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
+		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
 
-		/*
-		 * The attempt at page out may have made some
-		 * of the pages active, mark them inactive again.
-		 */
-		nr_active = clear_active_flags(&page_list, NULL);
-		count_vm_events(PGDEACTIVATE, nr_active);
+		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
+			struct page *page, *tmp;
+
+			/* Take off the clean pages marked for activation */
+			list_for_each_entry_safe(page, tmp, &page_list, lru) {
+				if (PageDirty(page) || PageWriteback(page))
+					continue;
+
+				list_del(&page->lru);
+				list_add(&page->lru, &putback_list);
+			}
+
+			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
+			congestion_wait(BLK_RW_ASYNC, HZ/10);
 
-		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+			/*
+			 * The attempt at page out may have made some
+			 * of the pages active, mark them inactive again.
+			 */
+			nr_active = clear_active_flags(&page_list, NULL);
+			count_vm_events(PGDEACTIVATE, nr_active);
+
+			nr_reclaimed += shrink_page_list(&page_list, sc,
+						PAGEOUT_IO_SYNC, &nr_dirty);
+		}
 	}
 
+	list_splice(&putback_list, &page_list);
+
 	local_irq_disable();
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-30 13:36 ` Mel Gorman
@ 2010-07-30 13:37   ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2010-07-30 13:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrea Arcangeli, Mel Gorman

There are a number of cases where pages get cleaned but two of concern
to this patch are;
  o When dirtying pages, processes may be throttled to clean pages if
    dirty_ratio is not met.
  o Pages belonging to inodes dirtied longer than
    dirty_writeback_centisecs get cleaned.

The problem for reclaim is that dirty pages can reach the end of the LRU if
pages are being dirtied slowly so that neither the throttling or a flusher
thread waking periodically cleans them.

Background flush is already cleaning old or expired inodes first but the
expire time is too far in the future at the time of page reclaim. To mitigate
future problems, this patch wakes flusher threads to clean 4M of data -
an amount that should be manageable without causing congestion in many cases.

Ideally, the background flushers would only be cleaning pages belonging
to the zone being scanned but it's not clear if this would be of benefit
(less IO) or not (potentially less efficient IO if an inode is scattered
across multiple zones).

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   33 +++++++++++++++++++++++++++++++--
 1 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2d2b588..c4c81bc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -142,6 +142,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
 /* Direct lumpy reclaim waits up to five seconds for background cleaning */
 #define MAX_SWAP_CLEAN_WAIT 50
 
+/*
+ * When reclaim encounters dirty data, wakeup flusher threads to clean
+ * a maximum of 4M of data.
+ */
+#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
+#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
+static inline long nr_writeback_pages(unsigned long nr_dirty)
+{
+	return laptop_mode ? 0 :
+			min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
+}
+
 static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
 						  struct scan_control *sc)
 {
@@ -649,12 +661,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
 static unsigned long shrink_page_list(struct list_head *page_list,
 					struct scan_control *sc,
 					enum pageout_io sync_writeback,
+					int file,
 					unsigned long *nr_still_dirty)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
 	int pgactivate = 0;
 	unsigned long nr_dirty = 0;
+	unsigned long nr_dirty_seen = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
@@ -748,6 +762,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		if (PageDirty(page)) {
+			nr_dirty_seen++;
+
 			/*
 			 * Only kswapd can writeback filesystem pages to
 			 * avoid risk of stack overflow
@@ -875,6 +891,18 @@ keep:
 
 	list_splice(&ret_pages, page_list);
 
+	/*
+	 * If reclaim is encountering dirty pages, it may be because
+	 * dirty pages are reaching the end of the LRU even though the
+	 * dirty_ratio may be satisified. In this case, wake flusher
+	 * threads to pro-actively clean up to a maximum of
+	 * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
+	 * !may_writepage indicates that this is a direct reclaimer in
+	 * laptop mode avoiding disk spin-ups
+	 */
+	if (file && nr_dirty_seen && sc->may_writepage)
+		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
+
 	*nr_still_dirty = nr_dirty;
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
@@ -1315,7 +1343,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	spin_unlock_irq(&zone->lru_lock);
 
 	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
-								&nr_dirty);
+							file, &nr_dirty);
 
 	/*
 	 * If specific pages are needed such as with direct reclaiming
@@ -1351,7 +1379,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 			count_vm_events(PGDEACTIVATE, nr_active);
 
 			nr_reclaimed += shrink_page_list(&page_list, sc,
-						PAGEOUT_IO_SYNC, &nr_dirty);
+						PAGEOUT_IO_SYNC, file,
+						&nr_dirty);
 		}
 	}
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-30 13:37   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2010-07-30 13:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrea Arcangeli, Mel Gorman

There are a number of cases where pages get cleaned but two of concern
to this patch are;
  o When dirtying pages, processes may be throttled to clean pages if
    dirty_ratio is not met.
  o Pages belonging to inodes dirtied longer than
    dirty_writeback_centisecs get cleaned.

The problem for reclaim is that dirty pages can reach the end of the LRU if
pages are being dirtied slowly so that neither the throttling or a flusher
thread waking periodically cleans them.

Background flush is already cleaning old or expired inodes first but the
expire time is too far in the future at the time of page reclaim. To mitigate
future problems, this patch wakes flusher threads to clean 4M of data -
an amount that should be manageable without causing congestion in many cases.

Ideally, the background flushers would only be cleaning pages belonging
to the zone being scanned but it's not clear if this would be of benefit
(less IO) or not (potentially less efficient IO if an inode is scattered
across multiple zones).

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   33 +++++++++++++++++++++++++++++++--
 1 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2d2b588..c4c81bc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -142,6 +142,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
 /* Direct lumpy reclaim waits up to five seconds for background cleaning */
 #define MAX_SWAP_CLEAN_WAIT 50
 
+/*
+ * When reclaim encounters dirty data, wakeup flusher threads to clean
+ * a maximum of 4M of data.
+ */
+#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
+#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
+static inline long nr_writeback_pages(unsigned long nr_dirty)
+{
+	return laptop_mode ? 0 :
+			min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
+}
+
 static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
 						  struct scan_control *sc)
 {
@@ -649,12 +661,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
 static unsigned long shrink_page_list(struct list_head *page_list,
 					struct scan_control *sc,
 					enum pageout_io sync_writeback,
+					int file,
 					unsigned long *nr_still_dirty)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
 	int pgactivate = 0;
 	unsigned long nr_dirty = 0;
+	unsigned long nr_dirty_seen = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
@@ -748,6 +762,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		if (PageDirty(page)) {
+			nr_dirty_seen++;
+
 			/*
 			 * Only kswapd can writeback filesystem pages to
 			 * avoid risk of stack overflow
@@ -875,6 +891,18 @@ keep:
 
 	list_splice(&ret_pages, page_list);
 
+	/*
+	 * If reclaim is encountering dirty pages, it may be because
+	 * dirty pages are reaching the end of the LRU even though the
+	 * dirty_ratio may be satisified. In this case, wake flusher
+	 * threads to pro-actively clean up to a maximum of
+	 * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
+	 * !may_writepage indicates that this is a direct reclaimer in
+	 * laptop mode avoiding disk spin-ups
+	 */
+	if (file && nr_dirty_seen && sc->may_writepage)
+		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
+
 	*nr_still_dirty = nr_dirty;
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
@@ -1315,7 +1343,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	spin_unlock_irq(&zone->lru_lock);
 
 	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
-								&nr_dirty);
+							file, &nr_dirty);
 
 	/*
 	 * If specific pages are needed such as with direct reclaiming
@@ -1351,7 +1379,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 			count_vm_events(PGDEACTIVATE, nr_active);
 
 			nr_reclaimed += shrink_page_list(&page_list, sc,
-						PAGEOUT_IO_SYNC, &nr_dirty);
+						PAGEOUT_IO_SYNC, file,
+						&nr_dirty);
 		}
 	}
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/6] vmscan: tracing: Roll up of patches currently in mmotm
  2010-07-30 13:36   ` Mel Gorman
@ 2010-07-30 14:04     ` Frederic Weisbecker
  -1 siblings, 0 replies; 58+ messages in thread
From: Frederic Weisbecker @ 2010-07-30 14:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli

On Fri, Jul 30, 2010 at 02:36:55PM +0100, Mel Gorman wrote:
> This is a roll-up of patches currently in mmotm related to stack reduction and
> tracing reclaim. It is based on 2.6.35-rc6 and included for the convenience
> of testing.
> 
> No signed off required.
> ---
>  .../trace/postprocess/trace-vmscan-postprocess.pl  |  654 ++++++++++++++++++++



I have the feeling you've made an ad-hoc post processing script that seems
to rewrite all the format parsing, debugfs, stream handling, etc... we
have that in perf tools already.

May be you weren't aware of what we have in perf in terms of scripting support.

First, launch perf list and spot the events you're interested in, let's
say you're interested in irqs:

$ perf list
  [...]
  irq:irq_handler_entry                      [Tracepoint event]
  irq:irq_handler_exit                       [Tracepoint event]
  irq:softirq_entry                          [Tracepoint event]
  irq:softirq_exit                           [Tracepoint event]
  [...]

Now do a trace record:

# perf record -e irq:irq_handler_entry -e irq:irq_handler_exit -e irq:softirq_entry -e irq:softirq_exit cmd

or more simple:

# perf record -e irq:* cmd

You can use -a instead of cmd for wide tracing.

Now generate a perf parsing script on top of these traces:

# perf trace -g perl
generated Perl script: perf-trace.pl


Fill up the trace handlers inside perf-trace.pl and just run it:

# perf trace -s perf-trace.pl

Once ready, you can place your script in the script directory.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/6] vmscan: tracing: Roll up of patches currently in mmotm
@ 2010-07-30 14:04     ` Frederic Weisbecker
  0 siblings, 0 replies; 58+ messages in thread
From: Frederic Weisbecker @ 2010-07-30 14:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli

On Fri, Jul 30, 2010 at 02:36:55PM +0100, Mel Gorman wrote:
> This is a roll-up of patches currently in mmotm related to stack reduction and
> tracing reclaim. It is based on 2.6.35-rc6 and included for the convenience
> of testing.
> 
> No signed off required.
> ---
>  .../trace/postprocess/trace-vmscan-postprocess.pl  |  654 ++++++++++++++++++++



I have the feeling you've made an ad-hoc post processing script that seems
to rewrite all the format parsing, debugfs, stream handling, etc... we
have that in perf tools already.

May be you weren't aware of what we have in perf in terms of scripting support.

First, launch perf list and spot the events you're interested in, let's
say you're interested in irqs:

$ perf list
  [...]
  irq:irq_handler_entry                      [Tracepoint event]
  irq:irq_handler_exit                       [Tracepoint event]
  irq:softirq_entry                          [Tracepoint event]
  irq:softirq_exit                           [Tracepoint event]
  [...]

Now do a trace record:

# perf record -e irq:irq_handler_entry -e irq:irq_handler_exit -e irq:softirq_entry -e irq:softirq_exit cmd

or more simple:

# perf record -e irq:* cmd

You can use -a instead of cmd for wide tracing.

Now generate a perf parsing script on top of these traces:

# perf trace -g perl
generated Perl script: perf-trace.pl


Fill up the trace handlers inside perf-trace.pl and just run it:

# perf trace -s perf-trace.pl

Once ready, you can place your script in the script directory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/6] vmscan: tracing: Roll up of patches currently in mmotm
  2010-07-30 14:04     ` Frederic Weisbecker
@ 2010-07-30 14:12       ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2010-07-30 14:12 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli

On Fri, Jul 30, 2010 at 04:04:42PM +0200, Frederic Weisbecker wrote:
> On Fri, Jul 30, 2010 at 02:36:55PM +0100, Mel Gorman wrote:
> > This is a roll-up of patches currently in mmotm related to stack reduction and
> > tracing reclaim. It is based on 2.6.35-rc6 and included for the convenience
> > of testing.
> > 
> > No signed off required.
> > ---
> >  .../trace/postprocess/trace-vmscan-postprocess.pl  |  654 ++++++++++++++++++++
> 
> I have the feeling you've made an ad-hoc post processing script that seems
> to rewrite all the format parsing, debugfs, stream handling, etc... we
> have that in perf tools already.
> 

It's an hoc adaption of trace-pagealloc-postprocess.pl which was developed
before the perf scripting report. It's a bit klunky.

> May be you weren't aware of what we have in perf in terms of scripting support.
> 

I'm aware, I just haven't gotten around to adapting what the script does
to the perf scripting support. The existance of the script I have means
people can reproduce my results without having to wait for me to rewrite
the post-processing scripts for perf.

> First, launch perf list and spot the events you're interested in, let's
> say you're interested in irqs:
> 
> $ perf list
>   [...]
>   irq:irq_handler_entry                      [Tracepoint event]
>   irq:irq_handler_exit                       [Tracepoint event]
>   irq:softirq_entry                          [Tracepoint event]
>   irq:softirq_exit                           [Tracepoint event]
>   [...]
> 
> Now do a trace record:
> 
> # perf record -e irq:irq_handler_entry -e irq:irq_handler_exit -e irq:softirq_entry -e irq:softirq_exit cmd
> 
> or more simple:
> 
> # perf record -e irq:* cmd
> 
> You can use -a instead of cmd for wide tracing.
> 
> Now generate a perf parsing script on top of these traces:
> 
> # perf trace -g perl
> generated Perl script: perf-trace.pl
> 
> Fill up the trace handlers inside perf-trace.pl and just run it:
> 
> # perf trace -s perf-trace.pl
> 
> Once ready, you can place your script in the script directory.
> 

Ultimately, the post-processing scripts should be adapted to perf but it
could be a while before I get around to it.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/6] vmscan: tracing: Roll up of patches currently in mmotm
@ 2010-07-30 14:12       ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2010-07-30 14:12 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli

On Fri, Jul 30, 2010 at 04:04:42PM +0200, Frederic Weisbecker wrote:
> On Fri, Jul 30, 2010 at 02:36:55PM +0100, Mel Gorman wrote:
> > This is a roll-up of patches currently in mmotm related to stack reduction and
> > tracing reclaim. It is based on 2.6.35-rc6 and included for the convenience
> > of testing.
> > 
> > No signed off required.
> > ---
> >  .../trace/postprocess/trace-vmscan-postprocess.pl  |  654 ++++++++++++++++++++
> 
> I have the feeling you've made an ad-hoc post processing script that seems
> to rewrite all the format parsing, debugfs, stream handling, etc... we
> have that in perf tools already.
> 

It's an hoc adaption of trace-pagealloc-postprocess.pl which was developed
before the perf scripting report. It's a bit klunky.

> May be you weren't aware of what we have in perf in terms of scripting support.
> 

I'm aware, I just haven't gotten around to adapting what the script does
to the perf scripting support. The existance of the script I have means
people can reproduce my results without having to wait for me to rewrite
the post-processing scripts for perf.

> First, launch perf list and spot the events you're interested in, let's
> say you're interested in irqs:
> 
> $ perf list
>   [...]
>   irq:irq_handler_entry                      [Tracepoint event]
>   irq:irq_handler_exit                       [Tracepoint event]
>   irq:softirq_entry                          [Tracepoint event]
>   irq:softirq_exit                           [Tracepoint event]
>   [...]
> 
> Now do a trace record:
> 
> # perf record -e irq:irq_handler_entry -e irq:irq_handler_exit -e irq:softirq_entry -e irq:softirq_exit cmd
> 
> or more simple:
> 
> # perf record -e irq:* cmd
> 
> You can use -a instead of cmd for wide tracing.
> 
> Now generate a perf parsing script on top of these traces:
> 
> # perf trace -g perl
> generated Perl script: perf-trace.pl
> 
> Fill up the trace handlers inside perf-trace.pl and just run it:
> 
> # perf trace -s perf-trace.pl
> 
> Once ready, you can place your script in the script directory.
> 

Ultimately, the post-processing scripts should be adapted to perf but it
could be a while before I get around to it.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/6] vmscan: tracing: Roll up of patches currently in mmotm
  2010-07-30 14:12       ` Mel Gorman
@ 2010-07-30 14:15         ` Frederic Weisbecker
  -1 siblings, 0 replies; 58+ messages in thread
From: Frederic Weisbecker @ 2010-07-30 14:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli

On Fri, Jul 30, 2010 at 03:12:18PM +0100, Mel Gorman wrote:
> On Fri, Jul 30, 2010 at 04:04:42PM +0200, Frederic Weisbecker wrote:
> > On Fri, Jul 30, 2010 at 02:36:55PM +0100, Mel Gorman wrote:
> > > This is a roll-up of patches currently in mmotm related to stack reduction and
> > > tracing reclaim. It is based on 2.6.35-rc6 and included for the convenience
> > > of testing.
> > > 
> > > No signed off required.
> > > ---
> > >  .../trace/postprocess/trace-vmscan-postprocess.pl  |  654 ++++++++++++++++++++
> > 
> > I have the feeling you've made an ad-hoc post processing script that seems
> > to rewrite all the format parsing, debugfs, stream handling, etc... we
> > have that in perf tools already.
> > 
> 
> It's an hoc adaption of trace-pagealloc-postprocess.pl which was developed
> before the perf scripting report. It's a bit klunky.
> 
> > May be you weren't aware of what we have in perf in terms of scripting support.
> > 
> 
> I'm aware, I just haven't gotten around to adapting what the script does
> to the perf scripting support. The existance of the script I have means
> people can reproduce my results without having to wait for me to rewrite
> the post-processing scripts for perf.
> 
> > First, launch perf list and spot the events you're interested in, let's
> > say you're interested in irqs:
> > 
> > $ perf list
> >   [...]
> >   irq:irq_handler_entry                      [Tracepoint event]
> >   irq:irq_handler_exit                       [Tracepoint event]
> >   irq:softirq_entry                          [Tracepoint event]
> >   irq:softirq_exit                           [Tracepoint event]
> >   [...]
> > 
> > Now do a trace record:
> > 
> > # perf record -e irq:irq_handler_entry -e irq:irq_handler_exit -e irq:softirq_entry -e irq:softirq_exit cmd
> > 
> > or more simple:
> > 
> > # perf record -e irq:* cmd
> > 
> > You can use -a instead of cmd for wide tracing.
> > 
> > Now generate a perf parsing script on top of these traces:
> > 
> > # perf trace -g perl
> > generated Perl script: perf-trace.pl
> > 
> > Fill up the trace handlers inside perf-trace.pl and just run it:
> > 
> > # perf trace -s perf-trace.pl
> > 
> > Once ready, you can place your script in the script directory.
> > 
> 
> Ultimately, the post-processing scripts should be adapted to perf but it
> could be a while before I get around to it.


Ok, I thought it was a brand new thing. No problem then.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/6] vmscan: tracing: Roll up of patches currently in mmotm
@ 2010-07-30 14:15         ` Frederic Weisbecker
  0 siblings, 0 replies; 58+ messages in thread
From: Frederic Weisbecker @ 2010-07-30 14:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli

On Fri, Jul 30, 2010 at 03:12:18PM +0100, Mel Gorman wrote:
> On Fri, Jul 30, 2010 at 04:04:42PM +0200, Frederic Weisbecker wrote:
> > On Fri, Jul 30, 2010 at 02:36:55PM +0100, Mel Gorman wrote:
> > > This is a roll-up of patches currently in mmotm related to stack reduction and
> > > tracing reclaim. It is based on 2.6.35-rc6 and included for the convenience
> > > of testing.
> > > 
> > > No signed off required.
> > > ---
> > >  .../trace/postprocess/trace-vmscan-postprocess.pl  |  654 ++++++++++++++++++++
> > 
> > I have the feeling you've made an ad-hoc post processing script that seems
> > to rewrite all the format parsing, debugfs, stream handling, etc... we
> > have that in perf tools already.
> > 
> 
> It's an hoc adaption of trace-pagealloc-postprocess.pl which was developed
> before the perf scripting report. It's a bit klunky.
> 
> > May be you weren't aware of what we have in perf in terms of scripting support.
> > 
> 
> I'm aware, I just haven't gotten around to adapting what the script does
> to the perf scripting support. The existance of the script I have means
> people can reproduce my results without having to wait for me to rewrite
> the post-processing scripts for perf.
> 
> > First, launch perf list and spot the events you're interested in, let's
> > say you're interested in irqs:
> > 
> > $ perf list
> >   [...]
> >   irq:irq_handler_entry                      [Tracepoint event]
> >   irq:irq_handler_exit                       [Tracepoint event]
> >   irq:softirq_entry                          [Tracepoint event]
> >   irq:softirq_exit                           [Tracepoint event]
> >   [...]
> > 
> > Now do a trace record:
> > 
> > # perf record -e irq:irq_handler_entry -e irq:irq_handler_exit -e irq:softirq_entry -e irq:softirq_exit cmd
> > 
> > or more simple:
> > 
> > # perf record -e irq:* cmd
> > 
> > You can use -a instead of cmd for wide tracing.
> > 
> > Now generate a perf parsing script on top of these traces:
> > 
> > # perf trace -g perl
> > generated Perl script: perf-trace.pl
> > 
> > Fill up the trace handlers inside perf-trace.pl and just run it:
> > 
> > # perf trace -s perf-trace.pl
> > 
> > Once ready, you can place your script in the script directory.
> > 
> 
> Ultimately, the post-processing scripts should be adapted to perf but it
> could be a while before I get around to it.


Ok, I thought it was a brand new thing. No problem then.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-30 13:37   ` Mel Gorman
@ 2010-07-30 22:06     ` Andrew Morton
  -1 siblings, 0 replies; 58+ messages in thread
From: Andrew Morton @ 2010-07-30 22:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrea Arcangeli

On Fri, 30 Jul 2010 14:37:00 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> There are a number of cases where pages get cleaned but two of concern
> to this patch are;
>   o When dirtying pages, processes may be throttled to clean pages if
>     dirty_ratio is not met.

Ambiguous.  I assume you meant "if dirty_ratio is exceeded".

>   o Pages belonging to inodes dirtied longer than
>     dirty_writeback_centisecs get cleaned.
> 
> The problem for reclaim is that dirty pages can reach the end of the LRU if
> pages are being dirtied slowly so that neither the throttling or a flusher
> thread waking periodically cleans them.
> 
> Background flush is already cleaning old or expired inodes first but the
> expire time is too far in the future at the time of page reclaim. To mitigate
> future problems, this patch wakes flusher threads to clean 4M of data -
> an amount that should be manageable without causing congestion in many cases.
> 
> Ideally, the background flushers would only be cleaning pages belonging
> to the zone being scanned but it's not clear if this would be of benefit
> (less IO) or not (potentially less efficient IO if an inode is scattered
> across multiple zones).
> 

Sigh.  We have sooo many problems with writeback and latency.  Read
https://bugzilla.kernel.org/show_bug.cgi?id=12309 and weep.  Everyone's
running away from the issue and here we are adding code to solve some
alleged stack-overflow problem which seems to be largely a non-problem,
by making changes which may worsen our real problems.

direct-reclaim wants to write a dirty page because that page is in the
zone which the caller wants to allcoate from!  Telling the flusher
threads to perform generic writeback will sometimes cause them to just
gum the disk up with pages from different zones, making it even
harder/slower to allocate a page from the zones we're interested in,
no?

If/when that happens, the problem will be rare, subtle, will take a
long time to get reported and will take years to understand and fix and
will probably be reported in the monster bug report which everyone's
hiding from anyway.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-30 22:06     ` Andrew Morton
  0 siblings, 0 replies; 58+ messages in thread
From: Andrew Morton @ 2010-07-30 22:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrea Arcangeli

On Fri, 30 Jul 2010 14:37:00 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> There are a number of cases where pages get cleaned but two of concern
> to this patch are;
>   o When dirtying pages, processes may be throttled to clean pages if
>     dirty_ratio is not met.

Ambiguous.  I assume you meant "if dirty_ratio is exceeded".

>   o Pages belonging to inodes dirtied longer than
>     dirty_writeback_centisecs get cleaned.
> 
> The problem for reclaim is that dirty pages can reach the end of the LRU if
> pages are being dirtied slowly so that neither the throttling or a flusher
> thread waking periodically cleans them.
> 
> Background flush is already cleaning old or expired inodes first but the
> expire time is too far in the future at the time of page reclaim. To mitigate
> future problems, this patch wakes flusher threads to clean 4M of data -
> an amount that should be manageable without causing congestion in many cases.
> 
> Ideally, the background flushers would only be cleaning pages belonging
> to the zone being scanned but it's not clear if this would be of benefit
> (less IO) or not (potentially less efficient IO if an inode is scattered
> across multiple zones).
> 

Sigh.  We have sooo many problems with writeback and latency.  Read
https://bugzilla.kernel.org/show_bug.cgi?id=12309 and weep.  Everyone's
running away from the issue and here we are adding code to solve some
alleged stack-overflow problem which seems to be largely a non-problem,
by making changes which may worsen our real problems.

direct-reclaim wants to write a dirty page because that page is in the
zone which the caller wants to allcoate from!  Telling the flusher
threads to perform generic writeback will sometimes cause them to just
gum the disk up with pages from different zones, making it even
harder/slower to allocate a page from the zones we're interested in,
no?

If/when that happens, the problem will be rare, subtle, will take a
long time to get reported and will take years to understand and fix and
will probably be reported in the monster bug report which everyone's
hiding from anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-30 22:06     ` Andrew Morton
  (?)
@ 2010-07-30 22:40       ` Trond Myklebust
  -1 siblings, 0 replies; 58+ messages in thread
From: Trond Myklebust @ 2010-07-30 22:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner,
	Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrea Arcangeli

On Fri, 2010-07-30 at 15:06 -0700, Andrew Morton wrote:
> On Fri, 30 Jul 2010 14:37:00 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > There are a number of cases where pages get cleaned but two of concern
> > to this patch are;
> >   o When dirtying pages, processes may be throttled to clean pages if
> >     dirty_ratio is not met.
> 
> Ambiguous.  I assume you meant "if dirty_ratio is exceeded".
> 
> >   o Pages belonging to inodes dirtied longer than
> >     dirty_writeback_centisecs get cleaned.
> > 
> > The problem for reclaim is that dirty pages can reach the end of the LRU if
> > pages are being dirtied slowly so that neither the throttling or a flusher
> > thread waking periodically cleans them.
> > 
> > Background flush is already cleaning old or expired inodes first but the
> > expire time is too far in the future at the time of page reclaim. To mitigate
> > future problems, this patch wakes flusher threads to clean 4M of data -
> > an amount that should be manageable without causing congestion in many cases.
> > 
> > Ideally, the background flushers would only be cleaning pages belonging
> > to the zone being scanned but it's not clear if this would be of benefit
> > (less IO) or not (potentially less efficient IO if an inode is scattered
> > across multiple zones).
> > 
> 
> Sigh.  We have sooo many problems with writeback and latency.  Read
> https://bugzilla.kernel.org/show_bug.cgi?id=12309 and weep.  Everyone's
> running away from the issue and here we are adding code to solve some
> alleged stack-overflow problem which seems to be largely a non-problem,
> by making changes which may worsen our real problems.
> 
> direct-reclaim wants to write a dirty page because that page is in the
> zone which the caller wants to allcoate from!  Telling the flusher
> threads to perform generic writeback will sometimes cause them to just
> gum the disk up with pages from different zones, making it even
> harder/slower to allocate a page from the zones we're interested in,
> no?
> 
> If/when that happens, the problem will be rare, subtle, will take a
> long time to get reported and will take years to understand and fix and
> will probably be reported in the monster bug report which everyone's
> hiding from anyway.

There is that, and then there are issues with the VM simply lying to the
filesystems.

See https://bugzilla.kernel.org/show_bug.cgi?id=16056

Which basically boils down to the following: kswapd tells the filesystem
that it is quite safe to do GFP_KERNEL allocations in pageouts and as
part of try_to_release_page().

In the case of pageouts, it does set the 'WB_SYNC_NONE', 'nonblocking'
and 'for_reclaim' flags in the writeback_control struct, and so the
filesystem has at least some hint that it should do non-blocking i/o.

However if you trust the GFP_KERNEL flag in try_to_release_page() then
the kernel can and will deadlock, and so I had to add in a hack
specifically to tell the NFS client not to trust that flag if it comes
from kswapd.

 Trond


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-30 22:40       ` Trond Myklebust
  0 siblings, 0 replies; 58+ messages in thread
From: Trond Myklebust @ 2010-07-30 22:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner,
	Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrea Arcangeli

On Fri, 2010-07-30 at 15:06 -0700, Andrew Morton wrote:
> On Fri, 30 Jul 2010 14:37:00 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > There are a number of cases where pages get cleaned but two of concern
> > to this patch are;
> >   o When dirtying pages, processes may be throttled to clean pages if
> >     dirty_ratio is not met.
> 
> Ambiguous.  I assume you meant "if dirty_ratio is exceeded".
> 
> >   o Pages belonging to inodes dirtied longer than
> >     dirty_writeback_centisecs get cleaned.
> > 
> > The problem for reclaim is that dirty pages can reach the end of the LRU if
> > pages are being dirtied slowly so that neither the throttling or a flusher
> > thread waking periodically cleans them.
> > 
> > Background flush is already cleaning old or expired inodes first but the
> > expire time is too far in the future at the time of page reclaim. To mitigate
> > future problems, this patch wakes flusher threads to clean 4M of data -
> > an amount that should be manageable without causing congestion in many cases.
> > 
> > Ideally, the background flushers would only be cleaning pages belonging
> > to the zone being scanned but it's not clear if this would be of benefit
> > (less IO) or not (potentially less efficient IO if an inode is scattered
> > across multiple zones).
> > 
> 
> Sigh.  We have sooo many problems with writeback and latency.  Read
> https://bugzilla.kernel.org/show_bug.cgi?id=12309 and weep.  Everyone's
> running away from the issue and here we are adding code to solve some
> alleged stack-overflow problem which seems to be largely a non-problem,
> by making changes which may worsen our real problems.
> 
> direct-reclaim wants to write a dirty page because that page is in the
> zone which the caller wants to allcoate from!  Telling the flusher
> threads to perform generic writeback will sometimes cause them to just
> gum the disk up with pages from different zones, making it even
> harder/slower to allocate a page from the zones we're interested in,
> no?
> 
> If/when that happens, the problem will be rare, subtle, will take a
> long time to get reported and will take years to understand and fix and
> will probably be reported in the monster bug report which everyone's
> hiding from anyway.

There is that, and then there are issues with the VM simply lying to the
filesystems.

See https://bugzilla.kernel.org/show_bug.cgi?id=16056

Which basically boils down to the following: kswapd tells the filesystem
that it is quite safe to do GFP_KERNEL allocations in pageouts and as
part of try_to_release_page().

In the case of pageouts, it does set the 'WB_SYNC_NONE', 'nonblocking'
and 'for_reclaim' flags in the writeback_control struct, and so the
filesystem has at least some hint that it should do non-blocking i/o.

However if you trust the GFP_KERNEL flag in try_to_release_page() then
the kernel can and will deadlock, and so I had to add in a hack
specifically to tell the NFS client not to trust that flag if it comes
from kswapd.

 Trond

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-30 22:40       ` Trond Myklebust
  0 siblings, 0 replies; 58+ messages in thread
From: Trond Myklebust @ 2010-07-30 22:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner,
	Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrea Arcangeli

On Fri, 2010-07-30 at 15:06 -0700, Andrew Morton wrote:
> On Fri, 30 Jul 2010 14:37:00 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > There are a number of cases where pages get cleaned but two of concern
> > to this patch are;
> >   o When dirtying pages, processes may be throttled to clean pages if
> >     dirty_ratio is not met.
> 
> Ambiguous.  I assume you meant "if dirty_ratio is exceeded".
> 
> >   o Pages belonging to inodes dirtied longer than
> >     dirty_writeback_centisecs get cleaned.
> > 
> > The problem for reclaim is that dirty pages can reach the end of the LRU if
> > pages are being dirtied slowly so that neither the throttling or a flusher
> > thread waking periodically cleans them.
> > 
> > Background flush is already cleaning old or expired inodes first but the
> > expire time is too far in the future at the time of page reclaim. To mitigate
> > future problems, this patch wakes flusher threads to clean 4M of data -
> > an amount that should be manageable without causing congestion in many cases.
> > 
> > Ideally, the background flushers would only be cleaning pages belonging
> > to the zone being scanned but it's not clear if this would be of benefit
> > (less IO) or not (potentially less efficient IO if an inode is scattered
> > across multiple zones).
> > 
> 
> Sigh.  We have sooo many problems with writeback and latency.  Read
> https://bugzilla.kernel.org/show_bug.cgi?id=12309 and weep.  Everyone's
> running away from the issue and here we are adding code to solve some
> alleged stack-overflow problem which seems to be largely a non-problem,
> by making changes which may worsen our real problems.
> 
> direct-reclaim wants to write a dirty page because that page is in the
> zone which the caller wants to allcoate from!  Telling the flusher
> threads to perform generic writeback will sometimes cause them to just
> gum the disk up with pages from different zones, making it even
> harder/slower to allocate a page from the zones we're interested in,
> no?
> 
> If/when that happens, the problem will be rare, subtle, will take a
> long time to get reported and will take years to understand and fix and
> will probably be reported in the monster bug report which everyone's
> hiding from anyway.

There is that, and then there are issues with the VM simply lying to the
filesystems.

See https://bugzilla.kernel.org/show_bug.cgi?id=16056

Which basically boils down to the following: kswapd tells the filesystem
that it is quite safe to do GFP_KERNEL allocations in pageouts and as
part of try_to_release_page().

In the case of pageouts, it does set the 'WB_SYNC_NONE', 'nonblocking'
and 'for_reclaim' flags in the writeback_control struct, and so the
filesystem has at least some hint that it should do non-blocking i/o.

However if you trust the GFP_KERNEL flag in try_to_release_page() then
the kernel can and will deadlock, and so I had to add in a hack
specifically to tell the NFS client not to trust that flag if it comes
from kswapd.

 Trond

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-30 22:06     ` Andrew Morton
@ 2010-07-31 10:33       ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2010-07-31 10:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrea Arcangeli

On Fri, Jul 30, 2010 at 03:06:01PM -0700, Andrew Morton wrote:
> On Fri, 30 Jul 2010 14:37:00 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > There are a number of cases where pages get cleaned but two of concern
> > to this patch are;
> >   o When dirtying pages, processes may be throttled to clean pages if
> >     dirty_ratio is not met.
> 
> Ambiguous.  I assume you meant "if dirty_ratio is exceeded".
> 

Yes.

> >   o Pages belonging to inodes dirtied longer than
> >     dirty_writeback_centisecs get cleaned.
> > 
> > The problem for reclaim is that dirty pages can reach the end of the LRU if
> > pages are being dirtied slowly so that neither the throttling or a flusher
> > thread waking periodically cleans them.
> > 
> > Background flush is already cleaning old or expired inodes first but the
> > expire time is too far in the future at the time of page reclaim. To mitigate
> > future problems, this patch wakes flusher threads to clean 4M of data -
> > an amount that should be manageable without causing congestion in many cases.
> > 
> > Ideally, the background flushers would only be cleaning pages belonging
> > to the zone being scanned but it's not clear if this would be of benefit
> > (less IO) or not (potentially less efficient IO if an inode is scattered
> > across multiple zones).
> > 
> 
> Sigh.  We have sooo many problems with writeback and latency.  Read
> https://bugzilla.kernel.org/show_bug.cgi?id=12309 and weep.

You aren't joking.

> Everyone's
> running away from the issue and here we are adding code to solve some
> alleged stack-overflow problem which seems to be largely a non-problem,
> by making changes which may worsen our real problems.
> 

As it is, filesystems are beginnning to ignore writeback from direct
reclaim - such as xfs and btrfs. I'm lead to believe that ext3
effectively ignores writeback from direct reclaim although I don't have
access to code at the moment to double check (am on the road). So either
way, we are going to be facing this problem so the VM might as well be
aware of it :/

> direct-reclaim wants to write a dirty page because that page is in the
> zone which the caller wants to allcoate from!  Telling the flusher
> threads to perform generic writeback will sometimes cause them to just
> gum the disk up with pages from different zones, making it even
> harder/slower to allocate a page from the zones we're interested in,
> no?
> 

It's a possibility, but it can happen anyway if the filesystem is ignoring
writeback requests from direct reclaim. I considered passing in the zone to
flusher threads to clean nr_pages from a given zone but then worried about
getting caught by the "poor IO pattern" people and what happened if two
zones needed cleaning with a single inodes pages in both.

> If/when that happens, the problem will be rare, subtle, will take a
> long time to get reported and will take years to understand and fix and
> will probably be reported in the monster bug report which everyone's
> hiding from anyway.
> 

With the second patch reducing the number of dirty pages encountered by page
reclaim, I'm hoping there will be some impact on latency. I'll be back online
properly Tuesday and will try reproducing some of the problems in that bug
and see can I spot an underlying cause of some sort.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-07-31 10:33       ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2010-07-31 10:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrea Arcangeli

On Fri, Jul 30, 2010 at 03:06:01PM -0700, Andrew Morton wrote:
> On Fri, 30 Jul 2010 14:37:00 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > There are a number of cases where pages get cleaned but two of concern
> > to this patch are;
> >   o When dirtying pages, processes may be throttled to clean pages if
> >     dirty_ratio is not met.
> 
> Ambiguous.  I assume you meant "if dirty_ratio is exceeded".
> 

Yes.

> >   o Pages belonging to inodes dirtied longer than
> >     dirty_writeback_centisecs get cleaned.
> > 
> > The problem for reclaim is that dirty pages can reach the end of the LRU if
> > pages are being dirtied slowly so that neither the throttling or a flusher
> > thread waking periodically cleans them.
> > 
> > Background flush is already cleaning old or expired inodes first but the
> > expire time is too far in the future at the time of page reclaim. To mitigate
> > future problems, this patch wakes flusher threads to clean 4M of data -
> > an amount that should be manageable without causing congestion in many cases.
> > 
> > Ideally, the background flushers would only be cleaning pages belonging
> > to the zone being scanned but it's not clear if this would be of benefit
> > (less IO) or not (potentially less efficient IO if an inode is scattered
> > across multiple zones).
> > 
> 
> Sigh.  We have sooo many problems with writeback and latency.  Read
> https://bugzilla.kernel.org/show_bug.cgi?id=12309 and weep.

You aren't joking.

> Everyone's
> running away from the issue and here we are adding code to solve some
> alleged stack-overflow problem which seems to be largely a non-problem,
> by making changes which may worsen our real problems.
> 

As it is, filesystems are beginnning to ignore writeback from direct
reclaim - such as xfs and btrfs. I'm lead to believe that ext3
effectively ignores writeback from direct reclaim although I don't have
access to code at the moment to double check (am on the road). So either
way, we are going to be facing this problem so the VM might as well be
aware of it :/

> direct-reclaim wants to write a dirty page because that page is in the
> zone which the caller wants to allcoate from!  Telling the flusher
> threads to perform generic writeback will sometimes cause them to just
> gum the disk up with pages from different zones, making it even
> harder/slower to allocate a page from the zones we're interested in,
> no?
> 

It's a possibility, but it can happen anyway if the filesystem is ignoring
writeback requests from direct reclaim. I considered passing in the zone to
flusher threads to clean nr_pages from a given zone but then worried about
getting caught by the "poor IO pattern" people and what happened if two
zones needed cleaning with a single inodes pages in both.

> If/when that happens, the problem will be rare, subtle, will take a
> long time to get reported and will take years to understand and fix and
> will probably be reported in the monster bug report which everyone's
> hiding from anyway.
> 

With the second patch reducing the number of dirty pages encountered by page
reclaim, I'm hoping there will be some impact on latency. I'll be back online
properly Tuesday and will try reproducing some of the problems in that bug
and see can I spot an underlying cause of some sort.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-30 22:40       ` Trond Myklebust
@ 2010-08-01  8:19         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 58+ messages in thread
From: KOSAKI Motohiro @ 2010-08-01  8:19 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: kosaki.motohiro, Andrew Morton, Mel Gorman, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, Andrea Arcangeli

Hi Trond,

> There is that, and then there are issues with the VM simply lying to the
> filesystems.
> 
> See https://bugzilla.kernel.org/show_bug.cgi?id=16056
> 
> Which basically boils down to the following: kswapd tells the filesystem
> that it is quite safe to do GFP_KERNEL allocations in pageouts and as
> part of try_to_release_page().
> 
> In the case of pageouts, it does set the 'WB_SYNC_NONE', 'nonblocking'
> and 'for_reclaim' flags in the writeback_control struct, and so the
> filesystem has at least some hint that it should do non-blocking i/o.
> 
> However if you trust the GFP_KERNEL flag in try_to_release_page() then
> the kernel can and will deadlock, and so I had to add in a hack
> specifically to tell the NFS client not to trust that flag if it comes
> from kswapd.

Can you please elaborate your issue more? vmscan logic is, briefly, below

	if (PageDirty(page))
		pageout(page)
	if (page_has_private(page)) {
		try_to_release_page(page, sc->gfp_mask))

So, I'm interest why nfs need to writeback at ->release_page again even
though pageout() call ->writepage and it was successfull.

In other word, an argument gfp_mask of try_to_release_page() is suspected
to pass kmalloc()/alloc_page() familiy. and page allocator have already care
PF_MEMALLOC flag.

So, My question is, What do you want additional work to VM folks?
Can you please share nfs design and what we should?


btw, Another question, Recently, Xiaotian Feng posted "swap over nfs -v21"
patch series. they have new reservation memory framework. Is this help you?





^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-08-01  8:19         ` KOSAKI Motohiro
  0 siblings, 0 replies; 58+ messages in thread
From: KOSAKI Motohiro @ 2010-08-01  8:19 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: kosaki.motohiro, Andrew Morton, Mel Gorman, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, Andrea Arcangeli

Hi Trond,

> There is that, and then there are issues with the VM simply lying to the
> filesystems.
> 
> See https://bugzilla.kernel.org/show_bug.cgi?id=16056
> 
> Which basically boils down to the following: kswapd tells the filesystem
> that it is quite safe to do GFP_KERNEL allocations in pageouts and as
> part of try_to_release_page().
> 
> In the case of pageouts, it does set the 'WB_SYNC_NONE', 'nonblocking'
> and 'for_reclaim' flags in the writeback_control struct, and so the
> filesystem has at least some hint that it should do non-blocking i/o.
> 
> However if you trust the GFP_KERNEL flag in try_to_release_page() then
> the kernel can and will deadlock, and so I had to add in a hack
> specifically to tell the NFS client not to trust that flag if it comes
> from kswapd.

Can you please elaborate your issue more? vmscan logic is, briefly, below

	if (PageDirty(page))
		pageout(page)
	if (page_has_private(page)) {
		try_to_release_page(page, sc->gfp_mask))

So, I'm interest why nfs need to writeback at ->release_page again even
though pageout() call ->writepage and it was successfull.

In other word, an argument gfp_mask of try_to_release_page() is suspected
to pass kmalloc()/alloc_page() familiy. and page allocator have already care
PF_MEMALLOC flag.

So, My question is, What do you want additional work to VM folks?
Can you please share nfs design and what we should?


btw, Another question, Recently, Xiaotian Feng posted "swap over nfs -v21"
patch series. they have new reservation memory framework. Is this help you?




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-30 22:06     ` Andrew Morton
@ 2010-08-01 11:15       ` Wu Fengguang
  -1 siblings, 0 replies; 58+ messages in thread
From: Wu Fengguang @ 2010-08-01 11:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrea Arcangeli

> Sigh.  We have sooo many problems with writeback and latency.  Read
> https://bugzilla.kernel.org/show_bug.cgi?id=12309 and weep.  Everyone's
> running away from the issue and here we are adding code to solve some
> alleged stack-overflow problem which seems to be largely a non-problem,
> by making changes which may worsen our real problems.

This looks like some vmscan/writeback interaction issue.

Firstly, the CFQ io scheduler can already prevent read IO from being
delayed by lots of ASYNC write IO. See the commits 365722bb/8e2967555
in late 2009.

Reading a big file in an idle system:
        680897928 bytes (681 MB) copied, 15.8986 s, 42.8 MB/s

Reading a big file while doing sequential writes to another file:
        680897928 bytes (681 MB) copied, 27.6007 s, 24.7 MB/s
        680897928 bytes (681 MB) copied, 25.6592 s, 26.5 MB/s

So CFQ offers reasonable read performance under heavy writeback.

Secondly, I can only feel the responsiveness lags when there are
memory pressures _in addition to_ heavy writeback.

        cp /dev/zero /tmp

No lags.

        usemem 1g --sleep 1000

Still no lags.

        usemem 1g --sleep 1000

Still no lags.

        usemem 1g --sleep 1000

Begin to feel lags at times. My desktop has 4G memory and no swap
space. So the lags are correlated with page reclaim pressure.

The above symptoms are matched very well by the patches posted by
KOSAKI and me:

- vmscan: raise the bar to PAGEOUT_IO_SYNC stalls
- vmscan: synchronous lumpy reclaim don't call congestion_wait()

However kernels as early as 2.6.18 are reported to have the problem,
so there may be more hidden issues.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-08-01 11:15       ` Wu Fengguang
  0 siblings, 0 replies; 58+ messages in thread
From: Wu Fengguang @ 2010-08-01 11:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrea Arcangeli

> Sigh.  We have sooo many problems with writeback and latency.  Read
> https://bugzilla.kernel.org/show_bug.cgi?id=12309 and weep.  Everyone's
> running away from the issue and here we are adding code to solve some
> alleged stack-overflow problem which seems to be largely a non-problem,
> by making changes which may worsen our real problems.

This looks like some vmscan/writeback interaction issue.

Firstly, the CFQ io scheduler can already prevent read IO from being
delayed by lots of ASYNC write IO. See the commits 365722bb/8e2967555
in late 2009.

Reading a big file in an idle system:
        680897928 bytes (681 MB) copied, 15.8986 s, 42.8 MB/s

Reading a big file while doing sequential writes to another file:
        680897928 bytes (681 MB) copied, 27.6007 s, 24.7 MB/s
        680897928 bytes (681 MB) copied, 25.6592 s, 26.5 MB/s

So CFQ offers reasonable read performance under heavy writeback.

Secondly, I can only feel the responsiveness lags when there are
memory pressures _in addition to_ heavy writeback.

        cp /dev/zero /tmp

No lags.

        usemem 1g --sleep 1000

Still no lags.

        usemem 1g --sleep 1000

Still no lags.

        usemem 1g --sleep 1000

Begin to feel lags at times. My desktop has 4G memory and no swap
space. So the lags are correlated with page reclaim pressure.

The above symptoms are matched very well by the patches posted by
KOSAKI and me:

- vmscan: raise the bar to PAGEOUT_IO_SYNC stalls
- vmscan: synchronous lumpy reclaim don't call congestion_wait()

However kernels as early as 2.6.18 are reported to have the problem,
so there may be more hidden issues.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-30 22:06     ` Andrew Morton
@ 2010-08-01 11:56       ` Wu Fengguang
  -1 siblings, 0 replies; 58+ messages in thread
From: Wu Fengguang @ 2010-08-01 11:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner,
	Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrea Arcangeli

> Sigh.  We have sooo many problems with writeback and latency.  Read
> https://bugzilla.kernel.org/show_bug.cgi?id=12309 and weep.  Everyone's
> running away from the issue and here we are adding code to solve some
> alleged stack-overflow problem which seems to be largely a non-problem,
> by making changes which may worsen our real problems.

I'm sweeping bug 12309. Most people reports some data writes, though
relative few explicitly stated memory pressure is another necessary
condition.

One interesting report is #3. Thomas reported the same slowdown
_without_ any IO. He was able to narrow down the bug to somewhere
between 2.6.20.21 and 2.6.22.19. I searched through the git and found
a congestion_wait() in commit 232ea4d69d (throttle_vm_writeout():
don't loop on GFP_NOFS and GFP_NOIO allocations) which was later
removed by commit 369f2389e7 (writeback: remove unnecessary wait in
throttle_vm_writeout()).

How can the congestion_wait(HZ/10) be a problem? Because it
unconditionally enters wait loop. So if no IO is underway, it
virtually becomes a schedule_timeout(HZ/10) because there are
no IO completion events to wake it up.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-08-01 11:56       ` Wu Fengguang
  0 siblings, 0 replies; 58+ messages in thread
From: Wu Fengguang @ 2010-08-01 11:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner,
	Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Andrea Arcangeli

> Sigh.  We have sooo many problems with writeback and latency.  Read
> https://bugzilla.kernel.org/show_bug.cgi?id=12309 and weep.  Everyone's
> running away from the issue and here we are adding code to solve some
> alleged stack-overflow problem which seems to be largely a non-problem,
> by making changes which may worsen our real problems.

I'm sweeping bug 12309. Most people reports some data writes, though
relative few explicitly stated memory pressure is another necessary
condition.

One interesting report is #3. Thomas reported the same slowdown
_without_ any IO. He was able to narrow down the bug to somewhere
between 2.6.20.21 and 2.6.22.19. I searched through the git and found
a congestion_wait() in commit 232ea4d69d (throttle_vm_writeout():
don't loop on GFP_NOFS and GFP_NOIO allocations) which was later
removed by commit 369f2389e7 (writeback: remove unnecessary wait in
throttle_vm_writeout()).

How can the congestion_wait(HZ/10) be a problem? Because it
unconditionally enters wait loop. So if no IO is underway, it
virtually becomes a schedule_timeout(HZ/10) because there are
no IO completion events to wake it up.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-08-01 11:56       ` Wu Fengguang
  (?)
@ 2010-08-01 13:03         ` Wu Fengguang
  -1 siblings, 0 replies; 58+ messages in thread
From: Wu Fengguang @ 2010-08-01 13:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner,
	Jens Axboe, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrea Arcangeli, pvz, bgamari, larppaxyz,
	seanj, kernel-bugs.dev1world, akatopaz, frankrq2009, thomas.pi,
	spawels13, vshader, rockorequin, ylalym, theholyettlz, hassium

On Sun, Aug 01, 2010 at 07:56:40PM +0800, Wu Fengguang wrote:
> > Sigh.  We have sooo many problems with writeback and latency.  Read
> > https://bugzilla.kernel.org/show_bug.cgi?id=12309 and weep.  Everyone's
> > running away from the issue and here we are adding code to solve some
> > alleged stack-overflow problem which seems to be largely a non-problem,
> > by making changes which may worsen our real problems.
> 
> I'm sweeping bug 12309. Most people reports some data writes, though
> relative few explicitly stated memory pressure is another necessary
> condition.

#14: Per von Zweigbergk
Ubuntu 2.6.27 slowdown when copying 25MB/s USB stick to 10 MB/s SSD.

KOSAKI and my patches won't fix 2.6.27, since it only do
congestion_wait() and wait_on_page_writeback() for order>3
allocations. There may be more bugs there.

#24: Per von Zweigbergk
The encryption of the SSD very significantly increases the problem.

This is expected. Data encryption roughly doubles page consumption
speed (there may be temp buffers allocated/dropped quickly), hence
vmscan pressure.

#26: Per von Zweigbergk
Disabling swap makes the terminal launch much faster while copying;
However Firefox and vim hang much more aggressively and frequently
during copying.

It's interesting to see processes behave differently. Is this
reproducible at all?

#34: Ben Gamari
There is evidence that x86-64 is a factor here.

Because x86-64 does order-1 page allocation in fork() and consumes
more memory (larger user space code/data)?

#36: Lari Temmes
Go from usable to totally unusable when switching from
a SMP kernel to a UP kernel on a single CPU laptop

He should be testing 2.6.28. I'm not aware of known bugs there.

#47: xyke
Renicing pdflush -10 had some great improvement on basic
responsiveness.

It sure helps :)

Too much (old) messages there. I'm hoping some of the still active
bug reporters to test the following patches (they are for the -mmotm
tree, need to unindent code for Linus's tree) and see if there are
any improvements.

http://lkml.org/lkml/2010/8/1/40
http://lkml.org/lkml/2010/8/1/45

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-08-01 13:03         ` Wu Fengguang
  0 siblings, 0 replies; 58+ messages in thread
From: Wu Fengguang @ 2010-08-01 13:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner,
	Jens Axboe, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrea Arcangeli, pvz, bgamari, larppaxyz,
	seanj, kernel-bugs.dev1world, akatopaz, frankrq2009, thomas.pi,
	spawels13, vshader, rockorequin, ylalym, theholyettlz, hassium

On Sun, Aug 01, 2010 at 07:56:40PM +0800, Wu Fengguang wrote:
> > Sigh.  We have sooo many problems with writeback and latency.  Read
> > https://bugzilla.kernel.org/show_bug.cgi?id=12309 and weep.  Everyone's
> > running away from the issue and here we are adding code to solve some
> > alleged stack-overflow problem which seems to be largely a non-problem,
> > by making changes which may worsen our real problems.
> 
> I'm sweeping bug 12309. Most people reports some data writes, though
> relative few explicitly stated memory pressure is another necessary
> condition.

#14: Per von Zweigbergk
Ubuntu 2.6.27 slowdown when copying 25MB/s USB stick to 10 MB/s SSD.

KOSAKI and my patches won't fix 2.6.27, since it only do
congestion_wait() and wait_on_page_writeback() for order>3
allocations. There may be more bugs there.

#24: Per von Zweigbergk
The encryption of the SSD very significantly increases the problem.

This is expected. Data encryption roughly doubles page consumption
speed (there may be temp buffers allocated/dropped quickly), hence
vmscan pressure.

#26: Per von Zweigbergk
Disabling swap makes the terminal launch much faster while copying;
However Firefox and vim hang much more aggressively and frequently
during copying.

It's interesting to see processes behave differently. Is this
reproducible at all?

#34: Ben Gamari
There is evidence that x86-64 is a factor here.

Because x86-64 does order-1 page allocation in fork() and consumes
more memory (larger user space code/data)?

#36: Lari Temmes
Go from usable to totally unusable when switching from
a SMP kernel to a UP kernel on a single CPU laptop

He should be testing 2.6.28. I'm not aware of known bugs there.

#47: xyke
Renicing pdflush -10 had some great improvement on basic
responsiveness.

It sure helps :)

Too much (old) messages there. I'm hoping some of the still active
bug reporters to test the following patches (they are for the -mmotm
tree, need to unindent code for Linus's tree) and see if there are
any improvements.

http://lkml.org/lkml/2010/8/1/40
http://lkml.org/lkml/2010/8/1/45

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-08-01 13:03         ` Wu Fengguang
  0 siblings, 0 replies; 58+ messages in thread
From: Wu Fengguang @ 2010-08-01 13:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner,
	Jens Axboe, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrea Arcangeli, pvz, bgamari, larppaxyz,
	seanj, kernel-bugs.dev1world, akatopaz, frankrq2009, thomas.pi,
	spawels13, vshader, rockorequin, ylalym, theholyettlz, hassium

On Sun, Aug 01, 2010 at 07:56:40PM +0800, Wu Fengguang wrote:
> > Sigh.  We have sooo many problems with writeback and latency.  Read
> > https://bugzilla.kernel.org/show_bug.cgi?id=12309 and weep.  Everyone's
> > running away from the issue and here we are adding code to solve some
> > alleged stack-overflow problem which seems to be largely a non-problem,
> > by making changes which may worsen our real problems.
> 
> I'm sweeping bug 12309. Most people reports some data writes, though
> relative few explicitly stated memory pressure is another necessary
> condition.

#14: Per von Zweigbergk
Ubuntu 2.6.27 slowdown when copying 25MB/s USB stick to 10 MB/s SSD.

KOSAKI and my patches won't fix 2.6.27, since it only do
congestion_wait() and wait_on_page_writeback() for order>3
allocations. There may be more bugs there.

#24: Per von Zweigbergk
The encryption of the SSD very significantly increases the problem.

This is expected. Data encryption roughly doubles page consumption
speed (there may be temp buffers allocated/dropped quickly), hence
vmscan pressure.

#26: Per von Zweigbergk
Disabling swap makes the terminal launch much faster while copying;
However Firefox and vim hang much more aggressively and frequently
during copying.

It's interesting to see processes behave differently. Is this
reproducible at all?

#34: Ben Gamari
There is evidence that x86-64 is a factor here.

Because x86-64 does order-1 page allocation in fork() and consumes
more memory (larger user space code/data)?

#36: Lari Temmes
Go from usable to totally unusable when switching from
a SMP kernel to a UP kernel on a single CPU laptop

He should be testing 2.6.28. I'm not aware of known bugs there.

#47: xyke
Renicing pdflush -10 had some great improvement on basic
responsiveness.

It sure helps :)

Too much (old) messages there. I'm hoping some of the still active
bug reporters to test the following patches (they are for the -mmotm
tree, need to unindent code for Linus's tree) and see if there are
any improvements.

http://lkml.org/lkml/2010/8/1/40
http://lkml.org/lkml/2010/8/1/45

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-08-01  8:19         ` KOSAKI Motohiro
  (?)
@ 2010-08-01 16:21           ` Trond Myklebust
  -1 siblings, 0 replies; 58+ messages in thread
From: Trond Myklebust @ 2010-08-01 16:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, Andrea Arcangeli

On Sun, 2010-08-01 at 17:19 +0900, KOSAKI Motohiro wrote:
> Hi Trond,
> 
> > There is that, and then there are issues with the VM simply lying to the
> > filesystems.
> > 
> > See https://bugzilla.kernel.org/show_bug.cgi?id=16056
> > 
> > Which basically boils down to the following: kswapd tells the filesystem
> > that it is quite safe to do GFP_KERNEL allocations in pageouts and as
> > part of try_to_release_page().
> > 
> > In the case of pageouts, it does set the 'WB_SYNC_NONE', 'nonblocking'
> > and 'for_reclaim' flags in the writeback_control struct, and so the
> > filesystem has at least some hint that it should do non-blocking i/o.
> > 
> > However if you trust the GFP_KERNEL flag in try_to_release_page() then
> > the kernel can and will deadlock, and so I had to add in a hack
> > specifically to tell the NFS client not to trust that flag if it comes
> > from kswapd.
> 
> Can you please elaborate your issue more? vmscan logic is, briefly, below
> 
> 	if (PageDirty(page))
> 		pageout(page)
> 	if (page_has_private(page)) {
> 		try_to_release_page(page, sc->gfp_mask))
> 
> So, I'm interest why nfs need to writeback at ->release_page again even
> though pageout() call ->writepage and it was successfull.
> 
> In other word, an argument gfp_mask of try_to_release_page() is suspected
> to pass kmalloc()/alloc_page() familiy. and page allocator have already care
> PF_MEMALLOC flag.
> 
> So, My question is, What do you want additional work to VM folks?
> Can you please share nfs design and what we should?
> 
> 
> btw, Another question, Recently, Xiaotian Feng posted "swap over nfs -v21"
> patch series. they have new reservation memory framework. Is this help you?

The problem that I am seeing is that the try_to_release_page() needs to
be told to act as a non-blocking call when the process is kswapd, just
like the pageout() call.

Currently, the sc->gfp_mask is set to GFP_KERNEL, which normally means
that the call may wait on I/O to complete. However, what I'm seeing in
the bugzilla above is that if kswapd waits on an RPC call, then the
whole VM may gum up: typically, the traces show that the socket layer
cannot allocate memory to hold the RPC reply from the server, and so it
is kicking kswapd to have it reclaim some pages, however kswapd is stuck
in try_to_release_page() waiting for that same I/O to complete, hence
the deadlock...

IOW: I think kswapd at least should be calling try_to_release_page()
with a gfp-flag of '0' to avoid deadlocking on I/O.

Cheers
  Trond


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-08-01 16:21           ` Trond Myklebust
  0 siblings, 0 replies; 58+ messages in thread
From: Trond Myklebust @ 2010-08-01 16:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, Andrea Arcangeli

On Sun, 2010-08-01 at 17:19 +0900, KOSAKI Motohiro wrote:
> Hi Trond,
> 
> > There is that, and then there are issues with the VM simply lying to the
> > filesystems.
> > 
> > See https://bugzilla.kernel.org/show_bug.cgi?id=16056
> > 
> > Which basically boils down to the following: kswapd tells the filesystem
> > that it is quite safe to do GFP_KERNEL allocations in pageouts and as
> > part of try_to_release_page().
> > 
> > In the case of pageouts, it does set the 'WB_SYNC_NONE', 'nonblocking'
> > and 'for_reclaim' flags in the writeback_control struct, and so the
> > filesystem has at least some hint that it should do non-blocking i/o.
> > 
> > However if you trust the GFP_KERNEL flag in try_to_release_page() then
> > the kernel can and will deadlock, and so I had to add in a hack
> > specifically to tell the NFS client not to trust that flag if it comes
> > from kswapd.
> 
> Can you please elaborate your issue more? vmscan logic is, briefly, below
> 
> 	if (PageDirty(page))
> 		pageout(page)
> 	if (page_has_private(page)) {
> 		try_to_release_page(page, sc->gfp_mask))
> 
> So, I'm interest why nfs need to writeback at ->release_page again even
> though pageout() call ->writepage and it was successfull.
> 
> In other word, an argument gfp_mask of try_to_release_page() is suspected
> to pass kmalloc()/alloc_page() familiy. and page allocator have already care
> PF_MEMALLOC flag.
> 
> So, My question is, What do you want additional work to VM folks?
> Can you please share nfs design and what we should?
> 
> 
> btw, Another question, Recently, Xiaotian Feng posted "swap over nfs -v21"
> patch series. they have new reservation memory framework. Is this help you?

The problem that I am seeing is that the try_to_release_page() needs to
be told to act as a non-blocking call when the process is kswapd, just
like the pageout() call.

Currently, the sc->gfp_mask is set to GFP_KERNEL, which normally means
that the call may wait on I/O to complete. However, what I'm seeing in
the bugzilla above is that if kswapd waits on an RPC call, then the
whole VM may gum up: typically, the traces show that the socket layer
cannot allocate memory to hold the RPC reply from the server, and so it
is kicking kswapd to have it reclaim some pages, however kswapd is stuck
in try_to_release_page() waiting for that same I/O to complete, hence
the deadlock...

IOW: I think kswapd at least should be calling try_to_release_page()
with a gfp-flag of '0' to avoid deadlocking on I/O.

Cheers
  Trond

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-08-01 16:21           ` Trond Myklebust
  0 siblings, 0 replies; 58+ messages in thread
From: Trond Myklebust @ 2010-08-01 16:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, Andrea Arcangeli

On Sun, 2010-08-01 at 17:19 +0900, KOSAKI Motohiro wrote:
> Hi Trond,
> 
> > There is that, and then there are issues with the VM simply lying to the
> > filesystems.
> > 
> > See https://bugzilla.kernel.org/show_bug.cgi?id=16056
> > 
> > Which basically boils down to the following: kswapd tells the filesystem
> > that it is quite safe to do GFP_KERNEL allocations in pageouts and as
> > part of try_to_release_page().
> > 
> > In the case of pageouts, it does set the 'WB_SYNC_NONE', 'nonblocking'
> > and 'for_reclaim' flags in the writeback_control struct, and so the
> > filesystem has at least some hint that it should do non-blocking i/o.
> > 
> > However if you trust the GFP_KERNEL flag in try_to_release_page() then
> > the kernel can and will deadlock, and so I had to add in a hack
> > specifically to tell the NFS client not to trust that flag if it comes
> > from kswapd.
> 
> Can you please elaborate your issue more? vmscan logic is, briefly, below
> 
> 	if (PageDirty(page))
> 		pageout(page)
> 	if (page_has_private(page)) {
> 		try_to_release_page(page, sc->gfp_mask))
> 
> So, I'm interest why nfs need to writeback at ->release_page again even
> though pageout() call ->writepage and it was successfull.
> 
> In other word, an argument gfp_mask of try_to_release_page() is suspected
> to pass kmalloc()/alloc_page() familiy. and page allocator have already care
> PF_MEMALLOC flag.
> 
> So, My question is, What do you want additional work to VM folks?
> Can you please share nfs design and what we should?
> 
> 
> btw, Another question, Recently, Xiaotian Feng posted "swap over nfs -v21"
> patch series. they have new reservation memory framework. Is this help you?

The problem that I am seeing is that the try_to_release_page() needs to
be told to act as a non-blocking call when the process is kswapd, just
like the pageout() call.

Currently, the sc->gfp_mask is set to GFP_KERNEL, which normally means
that the call may wait on I/O to complete. However, what I'm seeing in
the bugzilla above is that if kswapd waits on an RPC call, then the
whole VM may gum up: typically, the traces show that the socket layer
cannot allocate memory to hold the RPC reply from the server, and so it
is kicking kswapd to have it reclaim some pages, however kswapd is stuck
in try_to_release_page() waiting for that same I/O to complete, hence
the deadlock...

IOW: I think kswapd at least should be calling try_to_release_page()
with a gfp-flag of '0' to avoid deadlocking on I/O.

Cheers
  Trond

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
       [not found]         ` <80868B70-B17D-4007-AA15-5C11F0F95353@xyke.com>
  2010-08-02  2:30             ` Wu Fengguang
@ 2010-08-02  2:30             ` Wu Fengguang
  0 siblings, 0 replies; 58+ messages in thread
From: Wu Fengguang @ 2010-08-02  2:30 UTC (permalink / raw)
  To: Sean Jensen-Grey
  Cc: Andrew Morton, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Jens Axboe, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, pvz,
	bgamari, larppaxyz, seanj, kernel-bugs.dev1world, akatopaz,
	frankrq2009, thomas.pi, spawels13, vshader, rockorequin, ylalym,
	theholyettlz, hassium

Hi Sean,

On Mon, Aug 02, 2010 at 10:17:27AM +0800, Sean Jensen-Grey wrote:
> Wu,
> 
> Thank you for doing this. This still bites me on a weekly basis. I don't have much time to test the patches this week, but I should get access to an identical box week after next.

That's OK.

> BTW, I experience the issues even with 8-10GB of free ram. I have 12GB currently.

Thanks for the important information. It means the patches proposed
are not likely to help your case.

In Comment #47 for bug 12309, your kernel 2.6.27 is too old though.  You may
well benefit from Jens' CFQ low latency improvements if switching to a recent
kernel.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-08-02  2:30             ` Wu Fengguang
  0 siblings, 0 replies; 58+ messages in thread
From: Wu Fengguang @ 2010-08-02  2:30 UTC (permalink / raw)
  To: Sean Jensen-Grey
  Cc: Andrew Morton, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Jens Axboe, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, pvz,
	bgamari, larppaxyz, seanj, kernel-bugs.dev1world, akatopaz,
	frankrq2009, thomas.pi, spawels13, vshader, rockorequin, ylalym

Hi Sean,

On Mon, Aug 02, 2010 at 10:17:27AM +0800, Sean Jensen-Grey wrote:
> Wu,
> 
> Thank you for doing this. This still bites me on a weekly basis. I don't have much time to test the patches this week, but I should get access to an identical box week after next.

That's OK.

> BTW, I experience the issues even with 8-10GB of free ram. I have 12GB currently.

Thanks for the important information. It means the patches proposed
are not likely to help your case.

In Comment #47 for bug 12309, your kernel 2.6.27 is too old though.  You may
well benefit from Jens' CFQ low latency improvements if switching to a recent
kernel.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-08-02  2:30             ` Wu Fengguang
  0 siblings, 0 replies; 58+ messages in thread
From: Wu Fengguang @ 2010-08-02  2:30 UTC (permalink / raw)
  To: Sean Jensen-Grey
  Cc: Andrew Morton, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Jens Axboe, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, pvz,
	bgamari, larppaxyz, kernel-bugs.dev1world, akatopaz, frankrq2009,
	thomas.pi, spawels13, vshader, rockorequin, ylalym, theholyettlz,
	hassium

Hi Sean,

On Mon, Aug 02, 2010 at 10:17:27AM +0800, Sean Jensen-Grey wrote:
> Wu,
> 
> Thank you for doing this. This still bites me on a weekly basis. I don't have much time to test the patches this week, but I should get access to an identical box week after next.

That's OK.

> BTW, I experience the issues even with 8-10GB of free ram. I have 12GB currently.

Thanks for the important information. It means the patches proposed
are not likely to help your case.

In Comment #47 for bug 12309, your kernel 2.6.27 is too old though.  You may
well benefit from Jens' CFQ low latency improvements if switching to a recent
kernel.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-08-01 16:21           ` Trond Myklebust
@ 2010-08-02  7:57             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 58+ messages in thread
From: KOSAKI Motohiro @ 2010-08-02  7:57 UTC (permalink / raw)
  To: Trond Myklebust, Chris Mason
  Cc: kosaki.motohiro, Andrew Morton, Mel Gorman, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, Andrea Arcangeli

Hi

> The problem that I am seeing is that the try_to_release_page() needs to
> be told to act as a non-blocking call when the process is kswapd, just
> like the pageout() call.
> 
> Currently, the sc->gfp_mask is set to GFP_KERNEL, which normally means
> that the call may wait on I/O to complete. However, what I'm seeing in
> the bugzilla above is that if kswapd waits on an RPC call, then the
> whole VM may gum up: typically, the traces show that the socket layer
> cannot allocate memory to hold the RPC reply from the server, and so it
> is kicking kswapd to have it reclaim some pages, however kswapd is stuck
> in try_to_release_page() waiting for that same I/O to complete, hence
> the deadlock...

Ah, I see. so as far as I understand, you mean
 - Socket layer use GFP_ATOMIC, then they don't call try_to_free_pages().
   IOW, kswapd is only memory reclaiming thread.
 - Kswapd got stuck in ->release_page().
 - In usual use case, another thread call kmalloc(GFP_KERNEL) and makes
   foreground reclaim, then, restore kswapd stucking. but your case
   there is no such thread.

Hm, interesting.

In short term, current nfs fix (checking PF_MEMALLOC in nfs_wb_page())
seems best way. it's no side effect if my understanding is correct.


> IOW: I think kswapd at least should be calling try_to_release_page()
> with a gfp-flag of '0' to avoid deadlocking on I/O.

Hmmm.
0 seems to have very strong meanings rather than nfs required. 
There is no reason to prevent grabbing mutex, calling cond_resched() etc etc...

[digging old git history]

Ho hum...

Old commit log says passing gfp-flag=0 break xfs. but current xfs doesn't
use gfp_mask argument. hm.


============================================================
commit 68678e2fc6cfdfd013a2513fe416726f3c05b28d
Author: akpm <akpm>
Date:   Tue Sep 10 18:09:08 2002 +0000

    [PATCH] pass the correct flags to aops->releasepage()

    Restore the gfp_mask in the VM's call to a_ops->releasepage().  We can
    block in there again, and XFS (at least) can use that.

    BKrev: 3d7e35445skDsKDFM6rdiwTY-5elsw

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5ed1ec3..89d801e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -208,7 +208,7 @@ shrink_list(struct list_head *page_list, int nr_pages,
                 * Otherwise, leave the page on the LRU so it is swappable.
                 */
                if (PagePrivate(page)) {
-                       if (!try_to_release_page(page, 0))
+                       if (!try_to_release_page(page, gfp_mask))
                                goto keep_locked;
                        if (!mapping && page_count(page) == 1)
                                goto free_it;
============================================================

Now, gfp_mask of try_to_release_page() are used in two place.

btrfs: btrfs_releasepage		(check GFP_WAIT)
nfs: nfs_release_page			((gfp & GFP_KERNEL) == GFP_KERNEL)

Probably, btrfs can remove such GFP_WAIT check from try_release_extent_mapping
because it doesn't sleep. I dunno. if so, we can change it to 0 again. but
I'm not sure it has enough worth thing.

Chris, can we hear how btrfs handle gfp_mask argument of release_page()?



btw, VM fokls need more consider kswapd design. now kswapd oftern sleep.
But Trond's bug report says, waiting itself can makes deadlock potentially.
Perhaps it's merely imagine thing. but need to some consider...





^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-08-02  7:57             ` KOSAKI Motohiro
  0 siblings, 0 replies; 58+ messages in thread
From: KOSAKI Motohiro @ 2010-08-02  7:57 UTC (permalink / raw)
  To: Trond Myklebust, Chris Mason
  Cc: kosaki.motohiro, Andrew Morton, Mel Gorman, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, Andrea Arcangeli

Hi

> The problem that I am seeing is that the try_to_release_page() needs to
> be told to act as a non-blocking call when the process is kswapd, just
> like the pageout() call.
> 
> Currently, the sc->gfp_mask is set to GFP_KERNEL, which normally means
> that the call may wait on I/O to complete. However, what I'm seeing in
> the bugzilla above is that if kswapd waits on an RPC call, then the
> whole VM may gum up: typically, the traces show that the socket layer
> cannot allocate memory to hold the RPC reply from the server, and so it
> is kicking kswapd to have it reclaim some pages, however kswapd is stuck
> in try_to_release_page() waiting for that same I/O to complete, hence
> the deadlock...

Ah, I see. so as far as I understand, you mean
 - Socket layer use GFP_ATOMIC, then they don't call try_to_free_pages().
   IOW, kswapd is only memory reclaiming thread.
 - Kswapd got stuck in ->release_page().
 - In usual use case, another thread call kmalloc(GFP_KERNEL) and makes
   foreground reclaim, then, restore kswapd stucking. but your case
   there is no such thread.

Hm, interesting.

In short term, current nfs fix (checking PF_MEMALLOC in nfs_wb_page())
seems best way. it's no side effect if my understanding is correct.


> IOW: I think kswapd at least should be calling try_to_release_page()
> with a gfp-flag of '0' to avoid deadlocking on I/O.

Hmmm.
0 seems to have very strong meanings rather than nfs required. 
There is no reason to prevent grabbing mutex, calling cond_resched() etc etc...

[digging old git history]

Ho hum...

Old commit log says passing gfp-flag=0 break xfs. but current xfs doesn't
use gfp_mask argument. hm.


============================================================
commit 68678e2fc6cfdfd013a2513fe416726f3c05b28d
Author: akpm <akpm>
Date:   Tue Sep 10 18:09:08 2002 +0000

    [PATCH] pass the correct flags to aops->releasepage()

    Restore the gfp_mask in the VM's call to a_ops->releasepage().  We can
    block in there again, and XFS (at least) can use that.

    BKrev: 3d7e35445skDsKDFM6rdiwTY-5elsw

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5ed1ec3..89d801e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -208,7 +208,7 @@ shrink_list(struct list_head *page_list, int nr_pages,
                 * Otherwise, leave the page on the LRU so it is swappable.
                 */
                if (PagePrivate(page)) {
-                       if (!try_to_release_page(page, 0))
+                       if (!try_to_release_page(page, gfp_mask))
                                goto keep_locked;
                        if (!mapping && page_count(page) == 1)
                                goto free_it;
============================================================

Now, gfp_mask of try_to_release_page() are used in two place.

btrfs: btrfs_releasepage		(check GFP_WAIT)
nfs: nfs_release_page			((gfp & GFP_KERNEL) == GFP_KERNEL)

Probably, btrfs can remove such GFP_WAIT check from try_release_extent_mapping
because it doesn't sleep. I dunno. if so, we can change it to 0 again. but
I'm not sure it has enough worth thing.

Chris, can we hear how btrfs handle gfp_mask argument of release_page()?



btw, VM fokls need more consider kswapd design. now kswapd oftern sleep.
But Trond's bug report says, waiting itself can makes deadlock potentially.
Perhaps it's merely imagine thing. but need to some consider...




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-31 10:33       ` Mel Gorman
@ 2010-08-02 18:31         ` Jan Kara
  -1 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2010-08-02 18:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli

On Sat 31-07-10 11:33:22, Mel Gorman wrote:
> On Fri, Jul 30, 2010 at 03:06:01PM -0700, Andrew Morton wrote:
> > Sigh.  We have sooo many problems with writeback and latency.  Read
> > https://bugzilla.kernel.org/show_bug.cgi?id=12309 and weep.
> 
> You aren't joking.
> 
> > Everyone's
> > running away from the issue and here we are adding code to solve some
> > alleged stack-overflow problem which seems to be largely a non-problem,
> > by making changes which may worsen our real problems.
> > 
> 
> As it is, filesystems are beginnning to ignore writeback from direct
> reclaim - such as xfs and btrfs. I'm lead to believe that ext3
> effectively ignores writeback from direct reclaim although I don't have
> access to code at the moment to double check (am on the road). So either
> way, we are going to be facing this problem so the VM might as well be
> aware of it :/
  Umm, ext3 should be handling direct reclaim just fine. ext4 does however
ignore it when a page does not have block already allocated (which is a
common case with delayed allocation).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-08-02 18:31         ` Jan Kara
  0 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2010-08-02 18:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli

On Sat 31-07-10 11:33:22, Mel Gorman wrote:
> On Fri, Jul 30, 2010 at 03:06:01PM -0700, Andrew Morton wrote:
> > Sigh.  We have sooo many problems with writeback and latency.  Read
> > https://bugzilla.kernel.org/show_bug.cgi?id=12309 and weep.
> 
> You aren't joking.
> 
> > Everyone's
> > running away from the issue and here we are adding code to solve some
> > alleged stack-overflow problem which seems to be largely a non-problem,
> > by making changes which may worsen our real problems.
> > 
> 
> As it is, filesystems are beginnning to ignore writeback from direct
> reclaim - such as xfs and btrfs. I'm lead to believe that ext3
> effectively ignores writeback from direct reclaim although I don't have
> access to code at the moment to double check (am on the road). So either
> way, we are going to be facing this problem so the VM might as well be
> aware of it :/
  Umm, ext3 should be handling direct reclaim just fine. ext4 does however
ignore it when a page does not have block already allocated (which is a
common case with delayed allocation).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-07-30 13:37   ` Mel Gorman
@ 2010-08-05  6:45     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 58+ messages in thread
From: KOSAKI Motohiro @ 2010-08-05  6:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, Andrea Arcangeli


sorry for the _very_ delayed review.

> There are a number of cases where pages get cleaned but two of concern
> to this patch are;
>   o When dirtying pages, processes may be throttled to clean pages if
>     dirty_ratio is not met.
>   o Pages belonging to inodes dirtied longer than
>     dirty_writeback_centisecs get cleaned.
> 
> The problem for reclaim is that dirty pages can reach the end of the LRU if
> pages are being dirtied slowly so that neither the throttling or a flusher
> thread waking periodically cleans them.
> 
> Background flush is already cleaning old or expired inodes first but the
> expire time is too far in the future at the time of page reclaim. To mitigate
> future problems, this patch wakes flusher threads to clean 4M of data -
> an amount that should be manageable without causing congestion in many cases.
> 
> Ideally, the background flushers would only be cleaning pages belonging
> to the zone being scanned but it's not clear if this would be of benefit
> (less IO) or not (potentially less efficient IO if an inode is scattered
> across multiple zones).
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/vmscan.c |   33 +++++++++++++++++++++++++++++++--
>  1 files changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2d2b588..c4c81bc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -142,6 +142,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
>  /* Direct lumpy reclaim waits up to five seconds for background cleaning */
>  #define MAX_SWAP_CLEAN_WAIT 50
>  
> +/*
> + * When reclaim encounters dirty data, wakeup flusher threads to clean
> + * a maximum of 4M of data.
> + */
> +#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> +#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> +static inline long nr_writeback_pages(unsigned long nr_dirty)
> +{
> +	return laptop_mode ? 0 :
> +			min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> +}

??

As far as I remembered, Hannes pointed out wakeup_flusher_threads(0) is
incorrect. can you fix this?



> +
>  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
>  						  struct scan_control *sc)
>  {
> @@ -649,12 +661,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
>  static unsigned long shrink_page_list(struct list_head *page_list,
>  					struct scan_control *sc,
>  					enum pageout_io sync_writeback,
> +					int file,
>  					unsigned long *nr_still_dirty)
>  {
>  	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
>  	int pgactivate = 0;
>  	unsigned long nr_dirty = 0;
> +	unsigned long nr_dirty_seen = 0;
>  	unsigned long nr_reclaimed = 0;
>  
>  	cond_resched();
> @@ -748,6 +762,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		}
>  
>  		if (PageDirty(page)) {
> +			nr_dirty_seen++;
> +
>  			/*
>  			 * Only kswapd can writeback filesystem pages to
>  			 * avoid risk of stack overflow
> @@ -875,6 +891,18 @@ keep:
>  
>  	list_splice(&ret_pages, page_list);
>  
> +	/*
> +	 * If reclaim is encountering dirty pages, it may be because
> +	 * dirty pages are reaching the end of the LRU even though the
> +	 * dirty_ratio may be satisified. In this case, wake flusher
> +	 * threads to pro-actively clean up to a maximum of
> +	 * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> +	 * !may_writepage indicates that this is a direct reclaimer in
> +	 * laptop mode avoiding disk spin-ups
> +	 */
> +	if (file && nr_dirty_seen && sc->may_writepage)
> +		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));

Umm..
I don't think this guessing is so acculate. following is brief of
current isolate_lru_pages().


static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
                struct list_head *src, struct list_head *dst,
                unsigned long *scanned, int order, int mode, int file)
{
        for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
		__isolate_lru_page(page, mode, file))

                if (!order)
                        continue;

                /*
                 * Attempt to take all pages in the order aligned region
                 * surrounding the tag page.  Only take those pages of
                 * the same active state as that tag page.  We may safely
                 * round the target page pfn down to the requested order
                 * as the mem_map is guarenteed valid out to MAX_ORDER,
                 * where that page is in a different zone we will detect
                 * it from its zone id and abort this block scan.
                 */
                for (; pfn < end_pfn; pfn++) {
                        struct page *cursor_page;
			(snip)
		}

(This was unchanged since initial lumpy reclaim commit)

That said, merely order-1 isolate_lru_pages(ISOLATE_INACTIVE) makes pfn
neighbor search. then, we might found dirty pages even though the page
don't stay in end of lru.

What do you think?


> +
>  	*nr_still_dirty = nr_dirty;
>  	count_vm_events(PGACTIVATE, pgactivate);
>  	return nr_reclaimed;
> @@ -1315,7 +1343,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  	spin_unlock_irq(&zone->lru_lock);
>  
>  	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> -								&nr_dirty);
> +							file, &nr_dirty);
>  
>  	/*
>  	 * If specific pages are needed such as with direct reclaiming
> @@ -1351,7 +1379,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  			count_vm_events(PGDEACTIVATE, nr_active);
>  
>  			nr_reclaimed += shrink_page_list(&page_list, sc,
> -						PAGEOUT_IO_SYNC, &nr_dirty);
> +						PAGEOUT_IO_SYNC, file,
> +						&nr_dirty);
>  		}
>  	}
>  
> -- 
> 1.7.1
> 




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-08-05  6:45     ` KOSAKI Motohiro
  0 siblings, 0 replies; 58+ messages in thread
From: KOSAKI Motohiro @ 2010-08-05  6:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, Andrea Arcangeli


sorry for the _very_ delayed review.

> There are a number of cases where pages get cleaned but two of concern
> to this patch are;
>   o When dirtying pages, processes may be throttled to clean pages if
>     dirty_ratio is not met.
>   o Pages belonging to inodes dirtied longer than
>     dirty_writeback_centisecs get cleaned.
> 
> The problem for reclaim is that dirty pages can reach the end of the LRU if
> pages are being dirtied slowly so that neither the throttling or a flusher
> thread waking periodically cleans them.
> 
> Background flush is already cleaning old or expired inodes first but the
> expire time is too far in the future at the time of page reclaim. To mitigate
> future problems, this patch wakes flusher threads to clean 4M of data -
> an amount that should be manageable without causing congestion in many cases.
> 
> Ideally, the background flushers would only be cleaning pages belonging
> to the zone being scanned but it's not clear if this would be of benefit
> (less IO) or not (potentially less efficient IO if an inode is scattered
> across multiple zones).
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/vmscan.c |   33 +++++++++++++++++++++++++++++++--
>  1 files changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2d2b588..c4c81bc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -142,6 +142,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
>  /* Direct lumpy reclaim waits up to five seconds for background cleaning */
>  #define MAX_SWAP_CLEAN_WAIT 50
>  
> +/*
> + * When reclaim encounters dirty data, wakeup flusher threads to clean
> + * a maximum of 4M of data.
> + */
> +#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> +#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> +static inline long nr_writeback_pages(unsigned long nr_dirty)
> +{
> +	return laptop_mode ? 0 :
> +			min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> +}

??

As far as I remembered, Hannes pointed out wakeup_flusher_threads(0) is
incorrect. can you fix this?



> +
>  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
>  						  struct scan_control *sc)
>  {
> @@ -649,12 +661,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
>  static unsigned long shrink_page_list(struct list_head *page_list,
>  					struct scan_control *sc,
>  					enum pageout_io sync_writeback,
> +					int file,
>  					unsigned long *nr_still_dirty)
>  {
>  	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
>  	int pgactivate = 0;
>  	unsigned long nr_dirty = 0;
> +	unsigned long nr_dirty_seen = 0;
>  	unsigned long nr_reclaimed = 0;
>  
>  	cond_resched();
> @@ -748,6 +762,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		}
>  
>  		if (PageDirty(page)) {
> +			nr_dirty_seen++;
> +
>  			/*
>  			 * Only kswapd can writeback filesystem pages to
>  			 * avoid risk of stack overflow
> @@ -875,6 +891,18 @@ keep:
>  
>  	list_splice(&ret_pages, page_list);
>  
> +	/*
> +	 * If reclaim is encountering dirty pages, it may be because
> +	 * dirty pages are reaching the end of the LRU even though the
> +	 * dirty_ratio may be satisified. In this case, wake flusher
> +	 * threads to pro-actively clean up to a maximum of
> +	 * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> +	 * !may_writepage indicates that this is a direct reclaimer in
> +	 * laptop mode avoiding disk spin-ups
> +	 */
> +	if (file && nr_dirty_seen && sc->may_writepage)
> +		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));

Umm..
I don't think this guessing is so acculate. following is brief of
current isolate_lru_pages().


static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
                struct list_head *src, struct list_head *dst,
                unsigned long *scanned, int order, int mode, int file)
{
        for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
		__isolate_lru_page(page, mode, file))

                if (!order)
                        continue;

                /*
                 * Attempt to take all pages in the order aligned region
                 * surrounding the tag page.  Only take those pages of
                 * the same active state as that tag page.  We may safely
                 * round the target page pfn down to the requested order
                 * as the mem_map is guarenteed valid out to MAX_ORDER,
                 * where that page is in a different zone we will detect
                 * it from its zone id and abort this block scan.
                 */
                for (; pfn < end_pfn; pfn++) {
                        struct page *cursor_page;
			(snip)
		}

(This was unchanged since initial lumpy reclaim commit)

That said, merely order-1 isolate_lru_pages(ISOLATE_INACTIVE) makes pfn
neighbor search. then, we might found dirty pages even though the page
don't stay in end of lru.

What do you think?


> +
>  	*nr_still_dirty = nr_dirty;
>  	count_vm_events(PGACTIVATE, pgactivate);
>  	return nr_reclaimed;
> @@ -1315,7 +1343,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  	spin_unlock_irq(&zone->lru_lock);
>  
>  	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> -								&nr_dirty);
> +							file, &nr_dirty);
>  
>  	/*
>  	 * If specific pages are needed such as with direct reclaiming
> @@ -1351,7 +1379,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  			count_vm_events(PGDEACTIVATE, nr_active);
>  
>  			nr_reclaimed += shrink_page_list(&page_list, sc,
> -						PAGEOUT_IO_SYNC, &nr_dirty);
> +						PAGEOUT_IO_SYNC, file,
> +						&nr_dirty);
>  		}
>  	}
>  
> -- 
> 1.7.1
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 5/6] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-07-30 13:36   ` Mel Gorman
@ 2010-08-05  6:59     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 58+ messages in thread
From: KOSAKI Motohiro @ 2010-08-05  6:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, Andrea Arcangeli


again, very sorry for the delay.

> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
> 
> This patch prevents direct reclaim writing back filesystem pages by checking
> if current is kswapd or the page is anonymous before writing back.  If the
> dirty pages cannot be written back, they are placed back on the LRU lists
> for either background writing by the BDI threads or kswapd. If in direct
> lumpy reclaim and dirty pages are encountered, the process will stall for
> the background flusher before trying to reclaim the pages again.
> 
> As the call-chain for writing anonymous pages is not expected to be deep
> and they are not cleaned by flusher threads, anonymous pages are still
> written back in direct reclaim.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c |   69 ++++++++++++++++++++++++++++++++++++++++++++++------------
>  1 files changed, 54 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d83812a..2d2b588 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -139,6 +139,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
>  #define scanning_global_lru(sc)	(1)
>  #endif
>  
> +/* Direct lumpy reclaim waits up to five seconds for background cleaning */
> +#define MAX_SWAP_CLEAN_WAIT 50
> +
>  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
>  						  struct scan_control *sc)
>  {
> @@ -645,11 +648,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
>   */
>  static unsigned long shrink_page_list(struct list_head *page_list,
>  					struct scan_control *sc,
> -					enum pageout_io sync_writeback)
> +					enum pageout_io sync_writeback,
> +					unsigned long *nr_still_dirty)
>  {
>  	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
>  	int pgactivate = 0;
> +	unsigned long nr_dirty = 0;
>  	unsigned long nr_reclaimed = 0;
>  
>  	cond_resched();
> @@ -743,6 +748,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		}
>  
>  		if (PageDirty(page)) {
> +			/*
> +			 * Only kswapd can writeback filesystem pages to
> +			 * avoid risk of stack overflow
> +			 */
> +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> +				nr_dirty++;
> +				goto keep_locked;
> +			}
> +
>  			if (references == PAGEREF_RECLAIM_CLEAN)
>  				goto keep_locked;
>  			if (!may_enter_fs)
> @@ -860,6 +874,8 @@ keep:
>  	free_page_list(&free_pages);
>  
>  	list_splice(&ret_pages, page_list);
> +
> +	*nr_still_dirty = nr_dirty;
>  	count_vm_events(PGACTIVATE, pgactivate);
>  	return nr_reclaimed;
>  }
> @@ -1242,12 +1258,14 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  			struct scan_control *sc, int priority, int file)
>  {
>  	LIST_HEAD(page_list);
> +	LIST_HEAD(putback_list);
>  	unsigned long nr_scanned;
>  	unsigned long nr_reclaimed = 0;
>  	unsigned long nr_taken;
>  	unsigned long nr_active;
>  	unsigned long nr_anon;
>  	unsigned long nr_file;
> +	unsigned long nr_dirty;
>  
>  	while (unlikely(too_many_isolated(zone, file, sc))) {
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -1296,28 +1314,49 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  
>  	spin_unlock_irq(&zone->lru_lock);
>  
> -	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> +	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> +								&nr_dirty);
>  
>  	/*
> -	 * If we are direct reclaiming for contiguous pages and we do
> +	 * If specific pages are needed such as with direct reclaiming
> +	 * for contiguous pages or for memory containers and we do
>  	 * not reclaim everything in the list, try again and wait
> -	 * for IO to complete. This will stall high-order allocations
> -	 * but that should be acceptable to the caller
> +	 * for IO to complete. This will stall callers that require
> +	 * specific pages but it should be acceptable to the caller
>  	 */
> -	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> -			sc->lumpy_reclaim_mode) {
> -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> +	if (sc->may_writepage && !current_is_kswapd() &&
> +			(sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> +		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
>  
> -		/*
> -		 * The attempt at page out may have made some
> -		 * of the pages active, mark them inactive again.
> -		 */
> -		nr_active = clear_active_flags(&page_list, NULL);
> -		count_vm_events(PGDEACTIVATE, nr_active);
> +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> +			struct page *page, *tmp;
> +
> +			/* Take off the clean pages marked for activation */
> +			list_for_each_entry_safe(page, tmp, &page_list, lru) {
> +				if (PageDirty(page) || PageWriteback(page))
> +					continue;
> +
> +				list_del(&page->lru);
> +				list_add(&page->lru, &putback_list);
> +			}
> +
> +			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);

ditto.
wakeup_flusher_threads(0) is not correct?

And, When flusher thread still don't start IO, this loop don't have proper
waiting. do we need wait_on_page_dirty() or something?
(similar wait_on_page_writeback)



> +			congestion_wait(BLK_RW_ASYNC, HZ/10);

As we discussed, congestion_wait() don't works find if slow strage device
is connected.


>  
> -		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> +			/*
> +			 * The attempt at page out may have made some
> +			 * of the pages active, mark them inactive again.
> +			 */
> +			nr_active = clear_active_flags(&page_list, NULL);
> +			count_vm_events(PGDEACTIVATE, nr_active);
> +
> +			nr_reclaimed += shrink_page_list(&page_list, sc,
> +						PAGEOUT_IO_SYNC, &nr_dirty);

After my patch, when PAGEOUT_IO_SYNC failure, retry is no good idea.
can we remove this loop?


> +		}
>  	}
>  
> +	list_splice(&putback_list, &page_list);
> +
>  	local_irq_disable();
>  	if (current_is_kswapd())
>  		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
> -- 
> 1.7.1
> 




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 5/6] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-08-05  6:59     ` KOSAKI Motohiro
  0 siblings, 0 replies; 58+ messages in thread
From: KOSAKI Motohiro @ 2010-08-05  6:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, Andrea Arcangeli


again, very sorry for the delay.

> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
> 
> This patch prevents direct reclaim writing back filesystem pages by checking
> if current is kswapd or the page is anonymous before writing back.  If the
> dirty pages cannot be written back, they are placed back on the LRU lists
> for either background writing by the BDI threads or kswapd. If in direct
> lumpy reclaim and dirty pages are encountered, the process will stall for
> the background flusher before trying to reclaim the pages again.
> 
> As the call-chain for writing anonymous pages is not expected to be deep
> and they are not cleaned by flusher threads, anonymous pages are still
> written back in direct reclaim.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c |   69 ++++++++++++++++++++++++++++++++++++++++++++++------------
>  1 files changed, 54 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d83812a..2d2b588 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -139,6 +139,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
>  #define scanning_global_lru(sc)	(1)
>  #endif
>  
> +/* Direct lumpy reclaim waits up to five seconds for background cleaning */
> +#define MAX_SWAP_CLEAN_WAIT 50
> +
>  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
>  						  struct scan_control *sc)
>  {
> @@ -645,11 +648,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
>   */
>  static unsigned long shrink_page_list(struct list_head *page_list,
>  					struct scan_control *sc,
> -					enum pageout_io sync_writeback)
> +					enum pageout_io sync_writeback,
> +					unsigned long *nr_still_dirty)
>  {
>  	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
>  	int pgactivate = 0;
> +	unsigned long nr_dirty = 0;
>  	unsigned long nr_reclaimed = 0;
>  
>  	cond_resched();
> @@ -743,6 +748,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		}
>  
>  		if (PageDirty(page)) {
> +			/*
> +			 * Only kswapd can writeback filesystem pages to
> +			 * avoid risk of stack overflow
> +			 */
> +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> +				nr_dirty++;
> +				goto keep_locked;
> +			}
> +
>  			if (references == PAGEREF_RECLAIM_CLEAN)
>  				goto keep_locked;
>  			if (!may_enter_fs)
> @@ -860,6 +874,8 @@ keep:
>  	free_page_list(&free_pages);
>  
>  	list_splice(&ret_pages, page_list);
> +
> +	*nr_still_dirty = nr_dirty;
>  	count_vm_events(PGACTIVATE, pgactivate);
>  	return nr_reclaimed;
>  }
> @@ -1242,12 +1258,14 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  			struct scan_control *sc, int priority, int file)
>  {
>  	LIST_HEAD(page_list);
> +	LIST_HEAD(putback_list);
>  	unsigned long nr_scanned;
>  	unsigned long nr_reclaimed = 0;
>  	unsigned long nr_taken;
>  	unsigned long nr_active;
>  	unsigned long nr_anon;
>  	unsigned long nr_file;
> +	unsigned long nr_dirty;
>  
>  	while (unlikely(too_many_isolated(zone, file, sc))) {
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -1296,28 +1314,49 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  
>  	spin_unlock_irq(&zone->lru_lock);
>  
> -	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> +	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> +								&nr_dirty);
>  
>  	/*
> -	 * If we are direct reclaiming for contiguous pages and we do
> +	 * If specific pages are needed such as with direct reclaiming
> +	 * for contiguous pages or for memory containers and we do
>  	 * not reclaim everything in the list, try again and wait
> -	 * for IO to complete. This will stall high-order allocations
> -	 * but that should be acceptable to the caller
> +	 * for IO to complete. This will stall callers that require
> +	 * specific pages but it should be acceptable to the caller
>  	 */
> -	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> -			sc->lumpy_reclaim_mode) {
> -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> +	if (sc->may_writepage && !current_is_kswapd() &&
> +			(sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> +		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
>  
> -		/*
> -		 * The attempt at page out may have made some
> -		 * of the pages active, mark them inactive again.
> -		 */
> -		nr_active = clear_active_flags(&page_list, NULL);
> -		count_vm_events(PGDEACTIVATE, nr_active);
> +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> +			struct page *page, *tmp;
> +
> +			/* Take off the clean pages marked for activation */
> +			list_for_each_entry_safe(page, tmp, &page_list, lru) {
> +				if (PageDirty(page) || PageWriteback(page))
> +					continue;
> +
> +				list_del(&page->lru);
> +				list_add(&page->lru, &putback_list);
> +			}
> +
> +			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);

ditto.
wakeup_flusher_threads(0) is not correct?

And, When flusher thread still don't start IO, this loop don't have proper
waiting. do we need wait_on_page_dirty() or something?
(similar wait_on_page_writeback)



> +			congestion_wait(BLK_RW_ASYNC, HZ/10);

As we discussed, congestion_wait() don't works find if slow strage device
is connected.


>  
> -		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> +			/*
> +			 * The attempt at page out may have made some
> +			 * of the pages active, mark them inactive again.
> +			 */
> +			nr_active = clear_active_flags(&page_list, NULL);
> +			count_vm_events(PGDEACTIVATE, nr_active);
> +
> +			nr_reclaimed += shrink_page_list(&page_list, sc,
> +						PAGEOUT_IO_SYNC, &nr_dirty);

After my patch, when PAGEOUT_IO_SYNC failure, retry is no good idea.
can we remove this loop?


> +		}
>  	}
>  
> +	list_splice(&putback_list, &page_list);
> +
>  	local_irq_disable();
>  	if (current_is_kswapd())
>  		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
> -- 
> 1.7.1
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-08-05  6:45     ` KOSAKI Motohiro
@ 2010-08-05 14:09       ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2010-08-05 14:09 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, Andrea Arcangeli

On Thu, Aug 05, 2010 at 03:45:24PM +0900, KOSAKI Motohiro wrote:
> 
> sorry for the _very_ delayed review.
> 

Not to worry.

> > <SNIP>
> > +/*
> > + * When reclaim encounters dirty data, wakeup flusher threads to clean
> > + * a maximum of 4M of data.
> > + */
> > +#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> > +#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> > +static inline long nr_writeback_pages(unsigned long nr_dirty)
> > +{
> > +	return laptop_mode ? 0 :
> > +			min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> > +}
> 
> ??
> 
> As far as I remembered, Hannes pointed out wakeup_flusher_threads(0) is
> incorrect. can you fix this?
> 

It's behaving as it should, see http://lkml.org/lkml/2010/7/20/151

> 
> 
> > +
> >  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> >  						  struct scan_control *sc)
> >  {
> > @@ -649,12 +661,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> >  static unsigned long shrink_page_list(struct list_head *page_list,
> >  					struct scan_control *sc,
> >  					enum pageout_io sync_writeback,
> > +					int file,
> >  					unsigned long *nr_still_dirty)
> >  {
> >  	LIST_HEAD(ret_pages);
> >  	LIST_HEAD(free_pages);
> >  	int pgactivate = 0;
> >  	unsigned long nr_dirty = 0;
> > +	unsigned long nr_dirty_seen = 0;
> >  	unsigned long nr_reclaimed = 0;
> >  
> >  	cond_resched();
> > @@ -748,6 +762,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		}
> >  
> >  		if (PageDirty(page)) {
> > +			nr_dirty_seen++;
> > +
> >  			/*
> >  			 * Only kswapd can writeback filesystem pages to
> >  			 * avoid risk of stack overflow
> > @@ -875,6 +891,18 @@ keep:
> >  
> >  	list_splice(&ret_pages, page_list);
> >  
> > +	/*
> > +	 * If reclaim is encountering dirty pages, it may be because
> > +	 * dirty pages are reaching the end of the LRU even though the
> > +	 * dirty_ratio may be satisified. In this case, wake flusher
> > +	 * threads to pro-actively clean up to a maximum of
> > +	 * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> > +	 * !may_writepage indicates that this is a direct reclaimer in
> > +	 * laptop mode avoiding disk spin-ups
> > +	 */
> > +	if (file && nr_dirty_seen && sc->may_writepage)
> > +		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> 
> Umm..
> I don't think this guessing is so acculate. following is brief of
> current isolate_lru_pages().
> 
> 
> static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>                 struct list_head *src, struct list_head *dst,
>                 unsigned long *scanned, int order, int mode, int file)
> {
>         for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
> 		__isolate_lru_page(page, mode, file))
> 
>                 if (!order)
>                         continue;
> 
>                 /*
>                  * Attempt to take all pages in the order aligned region
>                  * surrounding the tag page.  Only take those pages of
>                  * the same active state as that tag page.  We may safely
>                  * round the target page pfn down to the requested order
>                  * as the mem_map is guarenteed valid out to MAX_ORDER,
>                  * where that page is in a different zone we will detect
>                  * it from its zone id and abort this block scan.
>                  */
>                 for (; pfn < end_pfn; pfn++) {
>                         struct page *cursor_page;
> 			(snip)
> 		}
> 
> (This was unchanged since initial lumpy reclaim commit)
> 

I think what you are pointing out is that when lumpy-reclaiming from the anon
LRU, there may be file pages on the page_list being shrinked. In that case, we
might miss an opportunity to wake the flusher threads when it was appropriate.

Is that accurate or have you another concern?

> That said, merely order-1 isolate_lru_pages(ISOLATE_INACTIVE) makes pfn
> neighbor search. then, we might found dirty pages even though the page
> don't stay in end of lru.
> 
> What do you think?
> 

For low-order lumpy reclaim, I think it should only be necessary to wake
the flusher threads when scanning the file LRU. While there may be file
pages lumpy reclaimed while scanning the anon list, I think we would
have to show it was a common and real problem before adding the
necessary accounting and checks.

> 
> > +
> >  	*nr_still_dirty = nr_dirty;
> >  	count_vm_events(PGACTIVATE, pgactivate);
> >  	return nr_reclaimed;
> > @@ -1315,7 +1343,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >  	spin_unlock_irq(&zone->lru_lock);
> >  
> >  	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> > -								&nr_dirty);
> > +							file, &nr_dirty);
> >  
> >  	/*
> >  	 * If specific pages are needed such as with direct reclaiming
> > @@ -1351,7 +1379,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >  			count_vm_events(PGDEACTIVATE, nr_active);
> >  
> >  			nr_reclaimed += shrink_page_list(&page_list, sc,
> > -						PAGEOUT_IO_SYNC, &nr_dirty);
> > +						PAGEOUT_IO_SYNC, file,
> > +						&nr_dirty);
> >  		}
> >  	}
> >  
> > -- 
> > 1.7.1
> > 
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-08-05 14:09       ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2010-08-05 14:09 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, Andrea Arcangeli

On Thu, Aug 05, 2010 at 03:45:24PM +0900, KOSAKI Motohiro wrote:
> 
> sorry for the _very_ delayed review.
> 

Not to worry.

> > <SNIP>
> > +/*
> > + * When reclaim encounters dirty data, wakeup flusher threads to clean
> > + * a maximum of 4M of data.
> > + */
> > +#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> > +#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> > +static inline long nr_writeback_pages(unsigned long nr_dirty)
> > +{
> > +	return laptop_mode ? 0 :
> > +			min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> > +}
> 
> ??
> 
> As far as I remembered, Hannes pointed out wakeup_flusher_threads(0) is
> incorrect. can you fix this?
> 

It's behaving as it should, see http://lkml.org/lkml/2010/7/20/151

> 
> 
> > +
> >  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> >  						  struct scan_control *sc)
> >  {
> > @@ -649,12 +661,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> >  static unsigned long shrink_page_list(struct list_head *page_list,
> >  					struct scan_control *sc,
> >  					enum pageout_io sync_writeback,
> > +					int file,
> >  					unsigned long *nr_still_dirty)
> >  {
> >  	LIST_HEAD(ret_pages);
> >  	LIST_HEAD(free_pages);
> >  	int pgactivate = 0;
> >  	unsigned long nr_dirty = 0;
> > +	unsigned long nr_dirty_seen = 0;
> >  	unsigned long nr_reclaimed = 0;
> >  
> >  	cond_resched();
> > @@ -748,6 +762,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		}
> >  
> >  		if (PageDirty(page)) {
> > +			nr_dirty_seen++;
> > +
> >  			/*
> >  			 * Only kswapd can writeback filesystem pages to
> >  			 * avoid risk of stack overflow
> > @@ -875,6 +891,18 @@ keep:
> >  
> >  	list_splice(&ret_pages, page_list);
> >  
> > +	/*
> > +	 * If reclaim is encountering dirty pages, it may be because
> > +	 * dirty pages are reaching the end of the LRU even though the
> > +	 * dirty_ratio may be satisified. In this case, wake flusher
> > +	 * threads to pro-actively clean up to a maximum of
> > +	 * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> > +	 * !may_writepage indicates that this is a direct reclaimer in
> > +	 * laptop mode avoiding disk spin-ups
> > +	 */
> > +	if (file && nr_dirty_seen && sc->may_writepage)
> > +		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> 
> Umm..
> I don't think this guessing is so acculate. following is brief of
> current isolate_lru_pages().
> 
> 
> static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>                 struct list_head *src, struct list_head *dst,
>                 unsigned long *scanned, int order, int mode, int file)
> {
>         for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
> 		__isolate_lru_page(page, mode, file))
> 
>                 if (!order)
>                         continue;
> 
>                 /*
>                  * Attempt to take all pages in the order aligned region
>                  * surrounding the tag page.  Only take those pages of
>                  * the same active state as that tag page.  We may safely
>                  * round the target page pfn down to the requested order
>                  * as the mem_map is guarenteed valid out to MAX_ORDER,
>                  * where that page is in a different zone we will detect
>                  * it from its zone id and abort this block scan.
>                  */
>                 for (; pfn < end_pfn; pfn++) {
>                         struct page *cursor_page;
> 			(snip)
> 		}
> 
> (This was unchanged since initial lumpy reclaim commit)
> 

I think what you are pointing out is that when lumpy-reclaiming from the anon
LRU, there may be file pages on the page_list being shrinked. In that case, we
might miss an opportunity to wake the flusher threads when it was appropriate.

Is that accurate or have you another concern?

> That said, merely order-1 isolate_lru_pages(ISOLATE_INACTIVE) makes pfn
> neighbor search. then, we might found dirty pages even though the page
> don't stay in end of lru.
> 
> What do you think?
> 

For low-order lumpy reclaim, I think it should only be necessary to wake
the flusher threads when scanning the file LRU. While there may be file
pages lumpy reclaimed while scanning the anon list, I think we would
have to show it was a common and real problem before adding the
necessary accounting and checks.

> 
> > +
> >  	*nr_still_dirty = nr_dirty;
> >  	count_vm_events(PGACTIVATE, pgactivate);
> >  	return nr_reclaimed;
> > @@ -1315,7 +1343,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >  	spin_unlock_irq(&zone->lru_lock);
> >  
> >  	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> > -								&nr_dirty);
> > +							file, &nr_dirty);
> >  
> >  	/*
> >  	 * If specific pages are needed such as with direct reclaiming
> > @@ -1351,7 +1379,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >  			count_vm_events(PGDEACTIVATE, nr_active);
> >  
> >  			nr_reclaimed += shrink_page_list(&page_list, sc,
> > -						PAGEOUT_IO_SYNC, &nr_dirty);
> > +						PAGEOUT_IO_SYNC, file,
> > +						&nr_dirty);
> >  		}
> >  	}
> >  
> > -- 
> > 1.7.1
> > 
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 5/6] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-08-05  6:59     ` KOSAKI Motohiro
@ 2010-08-05 14:15       ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2010-08-05 14:15 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, Andrea Arcangeli

On Thu, Aug 05, 2010 at 03:59:37PM +0900, KOSAKI Motohiro wrote:
> 
> again, very sorry for the delay.
> 

No problem.

> > When memory is under enough pressure, a process may enter direct
> > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > encountered during the scan, this page is written to backing storage using
> > mapping->writepage. This can result in very deep call stacks, particularly
> > if the target storage or filesystem are complex. It has already been observed
> > on XFS that the stack overflows but the problem is not XFS-specific.
> > 
> > This patch prevents direct reclaim writing back filesystem pages by checking
> > if current is kswapd or the page is anonymous before writing back.  If the
> > dirty pages cannot be written back, they are placed back on the LRU lists
> > for either background writing by the BDI threads or kswapd. If in direct
> > lumpy reclaim and dirty pages are encountered, the process will stall for
> > the background flusher before trying to reclaim the pages again.
> > 
> > As the call-chain for writing anonymous pages is not expected to be deep
> > and they are not cleaned by flusher threads, anonymous pages are still
> > written back in direct reclaim.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/vmscan.c |   69 ++++++++++++++++++++++++++++++++++++++++++++++------------
> >  1 files changed, 54 insertions(+), 15 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index d83812a..2d2b588 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -139,6 +139,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
> >  #define scanning_global_lru(sc)	(1)
> >  #endif
> >  
> > +/* Direct lumpy reclaim waits up to five seconds for background cleaning */
> > +#define MAX_SWAP_CLEAN_WAIT 50
> > +
> >  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> >  						  struct scan_control *sc)
> >  {
> > @@ -645,11 +648,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> >   */
> >  static unsigned long shrink_page_list(struct list_head *page_list,
> >  					struct scan_control *sc,
> > -					enum pageout_io sync_writeback)
> > +					enum pageout_io sync_writeback,
> > +					unsigned long *nr_still_dirty)
> >  {
> >  	LIST_HEAD(ret_pages);
> >  	LIST_HEAD(free_pages);
> >  	int pgactivate = 0;
> > +	unsigned long nr_dirty = 0;
> >  	unsigned long nr_reclaimed = 0;
> >  
> >  	cond_resched();
> > @@ -743,6 +748,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		}
> >  
> >  		if (PageDirty(page)) {
> > +			/*
> > +			 * Only kswapd can writeback filesystem pages to
> > +			 * avoid risk of stack overflow
> > +			 */
> > +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> > +				nr_dirty++;
> > +				goto keep_locked;
> > +			}
> > +
> >  			if (references == PAGEREF_RECLAIM_CLEAN)
> >  				goto keep_locked;
> >  			if (!may_enter_fs)
> > @@ -860,6 +874,8 @@ keep:
> >  	free_page_list(&free_pages);
> >  
> >  	list_splice(&ret_pages, page_list);
> > +
> > +	*nr_still_dirty = nr_dirty;
> >  	count_vm_events(PGACTIVATE, pgactivate);
> >  	return nr_reclaimed;
> >  }
> > @@ -1242,12 +1258,14 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >  			struct scan_control *sc, int priority, int file)
> >  {
> >  	LIST_HEAD(page_list);
> > +	LIST_HEAD(putback_list);
> >  	unsigned long nr_scanned;
> >  	unsigned long nr_reclaimed = 0;
> >  	unsigned long nr_taken;
> >  	unsigned long nr_active;
> >  	unsigned long nr_anon;
> >  	unsigned long nr_file;
> > +	unsigned long nr_dirty;
> >  
> >  	while (unlikely(too_many_isolated(zone, file, sc))) {
> >  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > @@ -1296,28 +1314,49 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >  
> >  	spin_unlock_irq(&zone->lru_lock);
> >  
> > -	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> > +	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> > +								&nr_dirty);
> >  
> >  	/*
> > -	 * If we are direct reclaiming for contiguous pages and we do
> > +	 * If specific pages are needed such as with direct reclaiming
> > +	 * for contiguous pages or for memory containers and we do
> >  	 * not reclaim everything in the list, try again and wait
> > -	 * for IO to complete. This will stall high-order allocations
> > -	 * but that should be acceptable to the caller
> > +	 * for IO to complete. This will stall callers that require
> > +	 * specific pages but it should be acceptable to the caller
> >  	 */
> > -	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> > -			sc->lumpy_reclaim_mode) {
> > -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +	if (sc->may_writepage && !current_is_kswapd() &&
> > +			(sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> > +		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> >  
> > -		/*
> > -		 * The attempt at page out may have made some
> > -		 * of the pages active, mark them inactive again.
> > -		 */
> > -		nr_active = clear_active_flags(&page_list, NULL);
> > -		count_vm_events(PGDEACTIVATE, nr_active);
> > +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > +			struct page *page, *tmp;
> > +
> > +			/* Take off the clean pages marked for activation */
> > +			list_for_each_entry_safe(page, tmp, &page_list, lru) {
> > +				if (PageDirty(page) || PageWriteback(page))
> > +					continue;
> > +
> > +				list_del(&page->lru);
> > +				list_add(&page->lru, &putback_list);
> > +			}
> > +
> > +			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> 
> ditto.
> wakeup_flusher_threads(0) is not correct?
> 

It's correct. When in lumpy mode, clean everything if the disk has to
spin up.

> And, When flusher thread still don't start IO, this loop don't have proper
> waiting. do we need wait_on_page_dirty() or something?
> (similar wait_on_page_writeback)
> 

If IO is not started on the correct pages, the flusher threads will be
rekicked for more work and another attempt is made at shrink_page_list.

> 
> 
> > +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> 
> As we discussed, congestion_wait() don't works find if slow strage device
> is connected.
> 

I currently support the removal of this congestion_wait(), but it belongs
in its own patch.

> 
> >  
> > -		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > +			/*
> > +			 * The attempt at page out may have made some
> > +			 * of the pages active, mark them inactive again.
> > +			 */
> > +			nr_active = clear_active_flags(&page_list, NULL);
> > +			count_vm_events(PGDEACTIVATE, nr_active);
> > +
> > +			nr_reclaimed += shrink_page_list(&page_list, sc,
> > +						PAGEOUT_IO_SYNC, &nr_dirty);
> 
> After my patch, when PAGEOUT_IO_SYNC failure, retry is no good idea.
> can we remove this loop?
> 

Such a removal belongs in the series related to lower latency of lumpy
reclaim. This patch is just about preventing dirty file pages being written
back by direct reclaim.

> 
> > +		}
> >  	}
> >  
> > +	list_splice(&putback_list, &page_list);
> > +
> >  	local_irq_disable();
> >  	if (current_is_kswapd())
> >  		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
> > -- 
> > 1.7.1
> > 
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 5/6] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-08-05 14:15       ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2010-08-05 14:15 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, Wu Fengguang,
	KAMEZAWA Hiroyuki, Andrea Arcangeli

On Thu, Aug 05, 2010 at 03:59:37PM +0900, KOSAKI Motohiro wrote:
> 
> again, very sorry for the delay.
> 

No problem.

> > When memory is under enough pressure, a process may enter direct
> > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > encountered during the scan, this page is written to backing storage using
> > mapping->writepage. This can result in very deep call stacks, particularly
> > if the target storage or filesystem are complex. It has already been observed
> > on XFS that the stack overflows but the problem is not XFS-specific.
> > 
> > This patch prevents direct reclaim writing back filesystem pages by checking
> > if current is kswapd or the page is anonymous before writing back.  If the
> > dirty pages cannot be written back, they are placed back on the LRU lists
> > for either background writing by the BDI threads or kswapd. If in direct
> > lumpy reclaim and dirty pages are encountered, the process will stall for
> > the background flusher before trying to reclaim the pages again.
> > 
> > As the call-chain for writing anonymous pages is not expected to be deep
> > and they are not cleaned by flusher threads, anonymous pages are still
> > written back in direct reclaim.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/vmscan.c |   69 ++++++++++++++++++++++++++++++++++++++++++++++------------
> >  1 files changed, 54 insertions(+), 15 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index d83812a..2d2b588 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -139,6 +139,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
> >  #define scanning_global_lru(sc)	(1)
> >  #endif
> >  
> > +/* Direct lumpy reclaim waits up to five seconds for background cleaning */
> > +#define MAX_SWAP_CLEAN_WAIT 50
> > +
> >  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> >  						  struct scan_control *sc)
> >  {
> > @@ -645,11 +648,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> >   */
> >  static unsigned long shrink_page_list(struct list_head *page_list,
> >  					struct scan_control *sc,
> > -					enum pageout_io sync_writeback)
> > +					enum pageout_io sync_writeback,
> > +					unsigned long *nr_still_dirty)
> >  {
> >  	LIST_HEAD(ret_pages);
> >  	LIST_HEAD(free_pages);
> >  	int pgactivate = 0;
> > +	unsigned long nr_dirty = 0;
> >  	unsigned long nr_reclaimed = 0;
> >  
> >  	cond_resched();
> > @@ -743,6 +748,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		}
> >  
> >  		if (PageDirty(page)) {
> > +			/*
> > +			 * Only kswapd can writeback filesystem pages to
> > +			 * avoid risk of stack overflow
> > +			 */
> > +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> > +				nr_dirty++;
> > +				goto keep_locked;
> > +			}
> > +
> >  			if (references == PAGEREF_RECLAIM_CLEAN)
> >  				goto keep_locked;
> >  			if (!may_enter_fs)
> > @@ -860,6 +874,8 @@ keep:
> >  	free_page_list(&free_pages);
> >  
> >  	list_splice(&ret_pages, page_list);
> > +
> > +	*nr_still_dirty = nr_dirty;
> >  	count_vm_events(PGACTIVATE, pgactivate);
> >  	return nr_reclaimed;
> >  }
> > @@ -1242,12 +1258,14 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >  			struct scan_control *sc, int priority, int file)
> >  {
> >  	LIST_HEAD(page_list);
> > +	LIST_HEAD(putback_list);
> >  	unsigned long nr_scanned;
> >  	unsigned long nr_reclaimed = 0;
> >  	unsigned long nr_taken;
> >  	unsigned long nr_active;
> >  	unsigned long nr_anon;
> >  	unsigned long nr_file;
> > +	unsigned long nr_dirty;
> >  
> >  	while (unlikely(too_many_isolated(zone, file, sc))) {
> >  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > @@ -1296,28 +1314,49 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >  
> >  	spin_unlock_irq(&zone->lru_lock);
> >  
> > -	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> > +	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> > +								&nr_dirty);
> >  
> >  	/*
> > -	 * If we are direct reclaiming for contiguous pages and we do
> > +	 * If specific pages are needed such as with direct reclaiming
> > +	 * for contiguous pages or for memory containers and we do
> >  	 * not reclaim everything in the list, try again and wait
> > -	 * for IO to complete. This will stall high-order allocations
> > -	 * but that should be acceptable to the caller
> > +	 * for IO to complete. This will stall callers that require
> > +	 * specific pages but it should be acceptable to the caller
> >  	 */
> > -	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> > -			sc->lumpy_reclaim_mode) {
> > -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +	if (sc->may_writepage && !current_is_kswapd() &&
> > +			(sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> > +		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> >  
> > -		/*
> > -		 * The attempt at page out may have made some
> > -		 * of the pages active, mark them inactive again.
> > -		 */
> > -		nr_active = clear_active_flags(&page_list, NULL);
> > -		count_vm_events(PGDEACTIVATE, nr_active);
> > +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > +			struct page *page, *tmp;
> > +
> > +			/* Take off the clean pages marked for activation */
> > +			list_for_each_entry_safe(page, tmp, &page_list, lru) {
> > +				if (PageDirty(page) || PageWriteback(page))
> > +					continue;
> > +
> > +				list_del(&page->lru);
> > +				list_add(&page->lru, &putback_list);
> > +			}
> > +
> > +			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> 
> ditto.
> wakeup_flusher_threads(0) is not correct?
> 

It's correct. When in lumpy mode, clean everything if the disk has to
spin up.

> And, When flusher thread still don't start IO, this loop don't have proper
> waiting. do we need wait_on_page_dirty() or something?
> (similar wait_on_page_writeback)
> 

If IO is not started on the correct pages, the flusher threads will be
rekicked for more work and another attempt is made at shrink_page_list.

> 
> 
> > +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> 
> As we discussed, congestion_wait() don't works find if slow strage device
> is connected.
> 

I currently support the removal of this congestion_wait(), but it belongs
in its own patch.

> 
> >  
> > -		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > +			/*
> > +			 * The attempt at page out may have made some
> > +			 * of the pages active, mark them inactive again.
> > +			 */
> > +			nr_active = clear_active_flags(&page_list, NULL);
> > +			count_vm_events(PGDEACTIVATE, nr_active);
> > +
> > +			nr_reclaimed += shrink_page_list(&page_list, sc,
> > +						PAGEOUT_IO_SYNC, &nr_dirty);
> 
> After my patch, when PAGEOUT_IO_SYNC failure, retry is no good idea.
> can we remove this loop?
> 

Such a removal belongs in the series related to lower latency of lumpy
reclaim. This patch is just about preventing dirty file pages being written
back by direct reclaim.

> 
> > +		}
> >  	}
> >  
> > +	list_splice(&putback_list, &page_list);
> > +
> >  	local_irq_disable();
> >  	if (current_is_kswapd())
> >  		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
> > -- 
> > 1.7.1
> > 
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* perf scripting
  2010-07-30 14:04     ` Frederic Weisbecker
  (?)
  (?)
@ 2010-08-14 20:04     ` Christoph Hellwig
  2010-09-16 12:08       ` Frederic Weisbecker
  -1 siblings, 1 reply; 58+ messages in thread
From: Christoph Hellwig @ 2010-08-14 20:04 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: linux-kernel

On Fri, Jul 30, 2010 at 04:04:42PM +0200, Frederic Weisbecker wrote:
> I have the feeling you've made an ad-hoc post processing script that seems
> to rewrite all the format parsing, debugfs, stream handling, etc... we
> have that in perf tools already.
> 
> May be you weren't aware of what we have in perf in terms of scripting support.

Frederic, any chance you could help me getting a bit more familar with
the perf perl scripting.  I currently have a hacky little sequence that
I use to profile what callers generate XFS log traffic, and it like to
turn it into a script so that I can do a direct perf call to use it
to profile things without manual work, and generate nicer output.

Currently it looks like this:

perf probe --add xlog_sync

perf record -g -e probe:xlog_sync -a -- <insert actualy workload here>

then do

perf report -n -g flat

to get me the callchain in a readable format.

Now what I'd really like is a perl script that can read a file like
latencytop.trans (or just has the information embedded) which contains
functions in the backtrace that we're interested in.

E.g. one simple from the report command above may look like:

                xlog_sync
		xlog_write
		xlog_cil_push
		_xfs_log_force
		xfs_log_force
		xfs_sync_data
		xfs_quiesce_data
		xfs_fs_sync_fs

In which case I'm interested in xfs_log_force and xfs_fs_sync_fs.  So
the output of the perl script should looks something like:


  Samples	Caller
	2	xfs_fs_sync_fs
	1	xfs_file_fsync
	1	xfs_commit_dummy_trans

Or if I have a way to parse the argument of the probe (in the worst case
I can replace it with a trace event if that makes it easier):

  Samples	Flags		Callers
	1	sync		xfs_fs_sync_fs
	1			xfs_fs_sync_fs
	1	sync		xfs_file_fsync
	1	sync		xfs_commit_dummy_trans


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: perf scripting
  2010-08-14 20:04     ` perf scripting Christoph Hellwig
@ 2010-09-16 12:08       ` Frederic Weisbecker
  2010-09-17 10:32         ` Masami Hiramatsu
  0 siblings, 1 reply; 58+ messages in thread
From: Frederic Weisbecker @ 2010-09-16 12:08 UTC (permalink / raw)
  To: Christoph Hellwig, Tom Zanussi
  Cc: linux-kernel, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Ingo Molnar, Steven Rostedt

(Sorry to answer that so late)


On Sat, Aug 14, 2010 at 04:04:15PM -0400, Christoph Hellwig wrote:
> On Fri, Jul 30, 2010 at 04:04:42PM +0200, Frederic Weisbecker wrote:
> > I have the feeling you've made an ad-hoc post processing script that seems
> > to rewrite all the format parsing, debugfs, stream handling, etc... we
> > have that in perf tools already.
> > 
> > May be you weren't aware of what we have in perf in terms of scripting support.
> 
> Frederic, any chance you could help me getting a bit more familar with
> the perf perl scripting.  I currently have a hacky little sequence that
> I use to profile what callers generate XFS log traffic, and it like to
> turn it into a script so that I can do a direct perf call to use it
> to profile things without manual work, and generate nicer output.
> 
> Currently it looks like this:
> 
> perf probe --add xlog_sync
> 
> perf record -g -e probe:xlog_sync -a -- <insert actualy workload here>
> 
> then do
> 
> perf report -n -g flat
> 
> to get me the callchain in a readable format.
> 
> Now what I'd really like is a perl script that can read a file like
> latencytop.trans (or just has the information embedded) which contains
> functions in the backtrace that we're interested in.
> 
> E.g. one simple from the report command above may look like:
> 
>                 xlog_sync
> 		xlog_write
> 		xlog_cil_push
> 		_xfs_log_force
> 		xfs_log_force
> 		xfs_sync_data
> 		xfs_quiesce_data
> 		xfs_fs_sync_fs
> 
> In which case I'm interested in xfs_log_force and xfs_fs_sync_fs.  So
> the output of the perl script should looks something like:
> 
> 
>   Samples	Caller
> 	2	xfs_fs_sync_fs
> 	1	xfs_file_fsync
> 	1	xfs_commit_dummy_trans


Somehow, that's a kind of overview you can get with
perf report, using the default fractal mode or the graph mode.
Callers are sorted by hits in these modes (actually in raw mode too).

But it could be interesting to add the callchains as arguments to the
perl/python scripting handlers for precise usecases.

 
> Or if I have a way to parse the argument of the probe (in the worst case
> I can replace it with a trace event if that makes it easier):
> 
>   Samples	Flags		Callers
> 	1	sync		xfs_fs_sync_fs
> 	1			xfs_fs_sync_fs
> 	1	sync		xfs_file_fsync
> 	1	sync		xfs_commit_dummy_trans


So for example that becomes an even more precise usecase.
Currently the perf scripting engine doesn't give you access
to the callchains of a trace sample. That would be a nice feature
and would solve your problem.

Tom, what do you think about that? This could be a special mode
requested by the user, or something made automatically if callchains
are present in samples. We could add a specific callchain extra
argument to the generated scripting handlers, or this could
be a generic extra dict argument that can contain whatever we want
(perf sample headers, etc...), whatever extra data the user might
request.

What do you think?

Thanks.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: perf scripting
  2010-09-16 12:08       ` Frederic Weisbecker
@ 2010-09-17 10:32         ` Masami Hiramatsu
  2010-09-18  5:04           ` Tom Zanussi
  0 siblings, 1 reply; 58+ messages in thread
From: Masami Hiramatsu @ 2010-09-17 10:32 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Christoph Hellwig, Tom Zanussi, linux-kernel,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, 2nddept-manager

(2010/09/16 21:08), Frederic Weisbecker wrote:
> (Sorry to answer that so late)
> 
> 
> On Sat, Aug 14, 2010 at 04:04:15PM -0400, Christoph Hellwig wrote:
>> On Fri, Jul 30, 2010 at 04:04:42PM +0200, Frederic Weisbecker wrote:
>>> I have the feeling you've made an ad-hoc post processing script that seems
>>> to rewrite all the format parsing, debugfs, stream handling, etc... we
>>> have that in perf tools already.
>>>
>>> May be you weren't aware of what we have in perf in terms of scripting support.
>>
>> Frederic, any chance you could help me getting a bit more familar with
>> the perf perl scripting.  I currently have a hacky little sequence that
>> I use to profile what callers generate XFS log traffic, and it like to
>> turn it into a script so that I can do a direct perf call to use it
>> to profile things without manual work, and generate nicer output.
>>
>> Currently it looks like this:
>>
>> perf probe --add xlog_sync
>>
>> perf record -g -e probe:xlog_sync -a -- <insert actualy workload here>
>>
>> then do
>>
>> perf report -n -g flat
>>
>> to get me the callchain in a readable format.
>>
>> Now what I'd really like is a perl script that can read a file like
>> latencytop.trans (or just has the information embedded) which contains
>> functions in the backtrace that we're interested in.
>>
>> E.g. one simple from the report command above may look like:
>>
>>                 xlog_sync
>> 		xlog_write
>> 		xlog_cil_push
>> 		_xfs_log_force
>> 		xfs_log_force
>> 		xfs_sync_data
>> 		xfs_quiesce_data
>> 		xfs_fs_sync_fs
>>
>> In which case I'm interested in xfs_log_force and xfs_fs_sync_fs.  So
>> the output of the perl script should looks something like:
>>
>>
>>   Samples	Caller
>> 	2	xfs_fs_sync_fs
>> 	1	xfs_file_fsync
>> 	1	xfs_commit_dummy_trans

BTW, if you want the caller for each call, you can do with perf probe

# perf probe --add 'xlog_sync caller=+0($stack)'

then, you can see the caller address in caller argument of
xlog_sync event record.


> Somehow, that's a kind of overview you can get with
> perf report, using the default fractal mode or the graph mode.
> Callers are sorted by hits in these modes (actually in raw mode too).
> 
> But it could be interesting to add the callchains as arguments to the
> perl/python scripting handlers for precise usecases.
> 
>  
>> Or if I have a way to parse the argument of the probe (in the worst case
>> I can replace it with a trace event if that makes it easier):
>>
>>   Samples	Flags		Callers
>> 	1	sync		xfs_fs_sync_fs
>> 	1			xfs_fs_sync_fs
>> 	1	sync		xfs_file_fsync
>> 	1	sync		xfs_commit_dummy_trans
> 
> 
> So for example that becomes an even more precise usecase.
> Currently the perf scripting engine doesn't give you access
> to the callchains of a trace sample. That would be a nice feature
> and would solve your problem.

AFAIK, perf perl script already supports getting arguments of
events. e.g.

sub probe::xlog_sync
{
    my ($event_name, $context, $common_cpu, $common_secs, $common_nsecs,
        $common_pid, $common_comm,
        $caller) = @_;

    if (!defined($caller_list{$caller})) {
         	$caller_list{$caller} = 0;
    }	
    $caller_list{$caller}++;
}

for count up caller address.
(However, perf perl currently doesn't have address-symbol translation
 function. )

If perf scripting supports calling perf internally for defining
new events for the script, it will be useful (from the viewpoint
of script packaging).

Thank you,

> 
> Tom, what do you think about that? This could be a special mode
> requested by the user, or something made automatically if callchains
> are present in samples. We could add a specific callchain extra
> argument to the generated scripting handlers, or this could
> be a generic extra dict argument that can contain whatever we want
> (perf sample headers, etc...), whatever extra data the user might
> request.
> 
> What do you think?
> 
> Thanks.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: perf scripting
  2010-09-17 10:32         ` Masami Hiramatsu
@ 2010-09-18  5:04           ` Tom Zanussi
  0 siblings, 0 replies; 58+ messages in thread
From: Tom Zanussi @ 2010-09-18  5:04 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Frederic Weisbecker, Christoph Hellwig, linux-kernel,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, 2nddept-manager

Hi,

On Fri, 2010-09-17 at 19:32 +0900, Masami Hiramatsu wrote:
> (2010/09/16 21:08), Frederic Weisbecker wrote:
> > (Sorry to answer that so late)
> > 
> > 
> > On Sat, Aug 14, 2010 at 04:04:15PM -0400, Christoph Hellwig wrote:
> >> On Fri, Jul 30, 2010 at 04:04:42PM +0200, Frederic Weisbecker wrote:
> >>> I have the feeling you've made an ad-hoc post processing script that seems
> >>> to rewrite all the format parsing, debugfs, stream handling, etc... we
> >>> have that in perf tools already.
> >>>
> >>> May be you weren't aware of what we have in perf in terms of scripting support.
> >>
> >> Frederic, any chance you could help me getting a bit more familar with
> >> the perf perl scripting.  I currently have a hacky little sequence that
> >> I use to profile what callers generate XFS log traffic, and it like to
> >> turn it into a script so that I can do a direct perf call to use it
> >> to profile things without manual work, and generate nicer output.
> >>
> >> Currently it looks like this:
> >>
> >> perf probe --add xlog_sync
> >>
> >> perf record -g -e probe:xlog_sync -a -- <insert actualy workload here>
> >>
> >> then do
> >>
> >> perf report -n -g flat
> >>
> >> to get me the callchain in a readable format.
> >>
> >> Now what I'd really like is a perl script that can read a file like
> >> latencytop.trans (or just has the information embedded) which contains
> >> functions in the backtrace that we're interested in.
> >>
> >> E.g. one simple from the report command above may look like:
> >>
> >>                 xlog_sync
> >> 		xlog_write
> >> 		xlog_cil_push
> >> 		_xfs_log_force
> >> 		xfs_log_force
> >> 		xfs_sync_data
> >> 		xfs_quiesce_data
> >> 		xfs_fs_sync_fs
> >>
> >> In which case I'm interested in xfs_log_force and xfs_fs_sync_fs.  So
> >> the output of the perl script should looks something like:
> >>
> >>
> >>   Samples	Caller
> >> 	2	xfs_fs_sync_fs
> >> 	1	xfs_file_fsync
> >> 	1	xfs_commit_dummy_trans
> 
> BTW, if you want the caller for each call, you can do with perf probe
> 
> # perf probe --add 'xlog_sync caller=+0($stack)'
> 
> then, you can see the caller address in caller argument of
> xlog_sync event record.
> 
> 
> > Somehow, that's a kind of overview you can get with
> > perf report, using the default fractal mode or the graph mode.
> > Callers are sorted by hits in these modes (actually in raw mode too).
> > 
> > But it could be interesting to add the callchains as arguments to the
> > perl/python scripting handlers for precise usecases.
> > 
> >  
> >> Or if I have a way to parse the argument of the probe (in the worst case
> >> I can replace it with a trace event if that makes it easier):
> >>
> >>   Samples	Flags		Callers
> >> 	1	sync		xfs_fs_sync_fs
> >> 	1			xfs_fs_sync_fs
> >> 	1	sync		xfs_file_fsync
> >> 	1	sync		xfs_commit_dummy_trans
> > 
> > 
> > So for example that becomes an even more precise usecase.
> > Currently the perf scripting engine doesn't give you access
> > to the callchains of a trace sample. That would be a nice feature
> > and would solve your problem.
> 
> AFAIK, perf perl script already supports getting arguments of
> events. e.g.
> 
> sub probe::xlog_sync
> {
>     my ($event_name, $context, $common_cpu, $common_secs, $common_nsecs,
>         $common_pid, $common_comm,
>         $caller) = @_;
> 
>     if (!defined($caller_list{$caller})) {
>          	$caller_list{$caller} = 0;
>     }	
>     $caller_list{$caller}++;
> }
> 
> for count up caller address.
> (However, perf perl currently doesn't have address-symbol translation
>  function. )
> 
> If perf scripting supports calling perf internally for defining
> new events for the script, it will be useful (from the viewpoint
> of script packaging).
> 
> Thank you,
> 
> > 
> > Tom, what do you think about that? This could be a special mode
> > requested by the user, or something made automatically if callchains
> > are present in samples. We could add a specific callchain extra
> > argument to the generated scripting handlers, or this could
> > be a generic extra dict argument that can contain whatever we want
> > (perf sample headers, etc...), whatever extra data the user might
> > request.
> > 
> > What do you think?
> > 

Rather than adding extra arguments to the handlers, how about providing
functions to get these, similar to how you can already get the 'less
common' fields of the event such as common_pc(), common_lock_depth(),
etc.  You call these during the event processing loop, giving them the
$context passed into the handler, and they in turn call perf internally
to get the data for the event.  This method can be used to get pretty
much anything out of perf internally.

So initially for callchains and callers we could have a couple functions
like:

get_callchain($context)

which would return an array of strings representing the callchain, and

get_caller($context)

which would just return the caller.

It would probably also make sense to have a raw version of at least
get_callchain() that would avoid the overhead of resolving the
callchains until a later time e.g. you should be able to use the raw
callchains as hash keys and resolve them only later when they get dumped
out in the report.  So maybe additionally something like:

get_raw_callchain($context)

which would return an array of u64 and a function to resolve a raw
callchain:

resolve_callchain($context, callchain[])

Or since in this case you need to keep the callchain and the $context
together, they could both just be contained in a callchain_object that
defines its own hash value and can be directly hashed or turned into the
string version by a callchain.tostring() method or something like that.

I don't know how soon I'll be able to implement this since I'm really
busy for the next several weeks, but if it makes sense, I can at least
try to do something for the first two (get_callchain() and
get_caller())...

Tom

> > Thanks.



^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2010-09-18  5:04 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-30 13:36 [PATCH 0/6] Reduce writeback from page reclaim context V6 Mel Gorman
2010-07-30 13:36 ` Mel Gorman
2010-07-30 13:36 ` [PATCH 1/6] vmscan: tracing: Roll up of patches currently in mmotm Mel Gorman
2010-07-30 13:36   ` Mel Gorman
2010-07-30 14:04   ` Frederic Weisbecker
2010-07-30 14:04     ` Frederic Weisbecker
2010-07-30 14:12     ` Mel Gorman
2010-07-30 14:12       ` Mel Gorman
2010-07-30 14:15       ` Frederic Weisbecker
2010-07-30 14:15         ` Frederic Weisbecker
2010-08-14 20:04     ` perf scripting Christoph Hellwig
2010-09-16 12:08       ` Frederic Weisbecker
2010-09-17 10:32         ` Masami Hiramatsu
2010-09-18  5:04           ` Tom Zanussi
2010-07-30 13:36 ` [PATCH 2/6] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages Mel Gorman
2010-07-30 13:36   ` Mel Gorman
2010-07-30 13:36 ` [PATCH 3/6] vmscan: tracing: Update post-processing script to distinguish between anon and file IO from page reclaim Mel Gorman
2010-07-30 13:36   ` Mel Gorman
2010-07-30 13:36 ` [PATCH 4/6] vmscan: tracing: Correct units in post-processing script Mel Gorman
2010-07-30 13:36   ` Mel Gorman
2010-07-30 13:36 ` [PATCH 5/6] vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman
2010-07-30 13:36   ` Mel Gorman
2010-08-05  6:59   ` KOSAKI Motohiro
2010-08-05  6:59     ` KOSAKI Motohiro
2010-08-05 14:15     ` Mel Gorman
2010-08-05 14:15       ` Mel Gorman
2010-07-30 13:37 ` [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages Mel Gorman
2010-07-30 13:37   ` Mel Gorman
2010-07-30 22:06   ` Andrew Morton
2010-07-30 22:06     ` Andrew Morton
2010-07-30 22:40     ` Trond Myklebust
2010-07-30 22:40       ` Trond Myklebust
2010-07-30 22:40       ` Trond Myklebust
2010-08-01  8:19       ` KOSAKI Motohiro
2010-08-01  8:19         ` KOSAKI Motohiro
2010-08-01 16:21         ` Trond Myklebust
2010-08-01 16:21           ` Trond Myklebust
2010-08-01 16:21           ` Trond Myklebust
2010-08-02  7:57           ` KOSAKI Motohiro
2010-08-02  7:57             ` KOSAKI Motohiro
2010-07-31 10:33     ` Mel Gorman
2010-07-31 10:33       ` Mel Gorman
2010-08-02 18:31       ` Jan Kara
2010-08-02 18:31         ` Jan Kara
2010-08-01 11:15     ` Wu Fengguang
2010-08-01 11:15       ` Wu Fengguang
2010-08-01 11:56     ` Wu Fengguang
2010-08-01 11:56       ` Wu Fengguang
2010-08-01 13:03       ` Wu Fengguang
2010-08-01 13:03         ` Wu Fengguang
2010-08-01 13:03         ` Wu Fengguang
     [not found]         ` <80868B70-B17D-4007-AA15-5C11F0F95353@xyke.com>
2010-08-02  2:30           ` Wu Fengguang
2010-08-02  2:30             ` Wu Fengguang
2010-08-02  2:30             ` Wu Fengguang
2010-08-05  6:45   ` KOSAKI Motohiro
2010-08-05  6:45     ` KOSAKI Motohiro
2010-08-05 14:09     ` Mel Gorman
2010-08-05 14:09       ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.