All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/14] Avoid overflowing of stack during page reclaim V3
@ 2010-06-29 11:34 ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

Here is V3 that depends again on flusher threads to do writeback in
direct reclaim rather than stack switching which is not something I'm
likely to get done before xfs/btrfs are ignoring writeback in mainline
(phd sucking up time). Instead, direct reclaimers that encounter dirty
pages call congestion_wait and in the case of lumpy reclaim will wait on
the specific pages. A memory pressure test did not show up premature OOM
problems that some had concerns about.

The details below are long but the short summary is that on balance this
patchset appears to behave better than the vanilla kernel with fewer pages
being written back by the VM, high order allocation under stress performing
quite well and xfs and btrfs both obeying writepage again. ext4 is still
largely ignoring writepage from reclaim context but it'd more significant
legwork to fix that.

Changelog since V2
  o Add acks and reviewed-bys
  o Do not lock multiple pages at the same time for writeback as it's unsafe
  o Drop the clean_page_list function. It alters timing with very little
    benefit. Without the contiguous writing, it doesn't do much to simplify
    the subsequent patches either
  o Throttle processes that encounter dirty pages in direct reclaim. Instead
    wakeup flusher threads to clean the number of pages encountered that were
    dirty
 
Changelog since V1
  o Merge with series that reduces stack usage in page reclaim in general
  o Allow memcg to writeback pages as they are not expected to overflow stack
  o Drop the contiguous-write patch for the moment

There is a problem in the stack depth usage of page reclaim. Particularly
during direct reclaim, it is possible to overflow the stack if it calls
into the filesystems writepage function. This patch series aims to trace
writebacks so it can be evaulated how many dirty pages are being written,
reduce stack usage of page reclaim in general and avoid direct reclaim
writing back pages and overflowing the stack.

The first patch is a fix by Nick Piggin to use lock_page_nosync instead of
lock_page when handling a write error as the mapping may have vanished after the page
was unlocked.

The second 4 patches are a forward-port of trace points that are partly based
on trace points defined by Larry Woodman but never merged. They trace parts of
kswapd, direct reclaim, LRU page isolation and page writeback. The tracepoints
can be used to evaluate what is happening within reclaim and whether things
are getting better or worse. They do not have to be part of the final series
but might be useful during discussion and for later regression testing -
particularly around the percentage of time spent in reclaim.

The 6 patches after that reduce the stack footprint of page reclaim by moving
large allocations out of the main call path. Functionally they should be
similar although there is a timing change on when pages get freed exactly.
This is aimed at giving filesystems as much stack as possible if kswapd is
to writeback pages directly.

Patch 12 prevents direct reclaim writing out pages at all and instead the
flusher threads are asked to clean the number of pages encountered, the caller
waits on congestion and puts the pages back on the LRU.  For lumpy reclaim,
the caller will wait for a time calling the flusher multiple times waiting
on dirty pages to be written out before trying to reclaim the dirty pages a
second time. This increases the responsibility of kswapd somewhat because
it's now cleaning pages on behalf of direct reclaimers but kswapd seemed
a better fit than background flushers to clean pages as it knows where the
pages needing cleaning are. As it's async IO, it should not cause kswapd to
stall (at least until the queue is congested) but the order that pages are
reclaimed on the LRU is altered. Dirty pages that would have been reclaimed
by direct reclaimers are getting another lap on the LRU. The dirty pages
could have been put on a dedicated list but this increased counter overhead
and the number of lists and it is unclear if it is necessary.

The final two patches revert chances on XFS and btrfs that ignore writeback
from reclaim context.

I ran a number of tests with monitoring on X86, X86-64 and PPC64 and I'll
cover what the X86-64 results were here. It's an AMD Phenom 4-core machine
with 2G of RAM with a single disk and the onboard IO controller. Dirty ratio
was left at 20 but tests with 40 did not show up surprises. The filesystem
all the tests were run on was XFS.

Three kernels are compared.

traceonly-v3r1		is the first 4 patches of this series
stackreduce-v3r1	is the first 12 patches of this series
nodirect-v3r9		is all patches in the series

The results on each test is broken up into three parts. The first part
compares the results of the test itself. The second part is a report based
on the ftrace postprocessing script in patch 4 and reports on direct reclaim
and kswapd activity. The third part reports what percentage of time was
spent in direct reclaim and kswapd being awake.

To work out the percentage of time spent in direct reclaim, I used
/usr/bin/time to get the User + Sys CPU time. The stalled time was taken
from the post-processing script.  The total time is (User + Sys + Stall)
and obviously the percentage is of stalled over total time.

kernbench
=========

                traceonly-v3r1  stackreduce-v3r1     nodirect-v3r9
Elapsed min       98.00 ( 0.00%)    98.26 (-0.27%)    98.11 (-0.11%)
Elapsed mean      98.19 ( 0.00%)    98.35 (-0.16%)    98.22 (-0.03%)
Elapsed stddev     0.16 ( 0.00%)     0.06 (62.07%)     0.11 (32.71%)
Elapsed max       98.44 ( 0.00%)    98.42 ( 0.02%)    98.39 ( 0.05%)
User    min      310.43 ( 0.00%)   311.14 (-0.23%)   311.76 (-0.43%)
User    mean     311.34 ( 0.00%)   311.48 (-0.04%)   312.03 (-0.22%)
User    stddev     0.76 ( 0.00%)     0.34 (55.52%)     0.20 (73.54%)
User    max      312.51 ( 0.00%)   311.86 ( 0.21%)   312.30 ( 0.07%)
System  min       39.62 ( 0.00%)    40.76 (-2.88%)    40.03 (-1.03%)
System  mean      40.44 ( 0.00%)    40.92 (-1.20%)    40.18 ( 0.64%)
System  stddev     0.53 ( 0.00%)     0.20 (62.94%)     0.09 (82.81%)
System  max       41.11 ( 0.00%)    41.25 (-0.34%)    40.28 ( 2.02%)
CPU     min      357.00 ( 0.00%)   358.00 (-0.28%)   358.00 (-0.28%)
CPU     mean     357.75 ( 0.00%)   358.00 (-0.07%)   358.00 (-0.07%)
CPU     stddev     0.43 ( 0.00%)     0.00 (100.00%)     0.00 (100.00%)
CPU     max      358.00 ( 0.00%)   358.00 ( 0.00%)   358.00 ( 0.00%)
FTrace Reclaim Statistics
                                    traceonly-v3r1  stackreduce-v3r1   nodirect-v3r9
Direct reclaims                                  0          0          0 
Direct reclaim pages scanned                     0          0          0 
Direct reclaim write async I/O                   0          0          0 
Direct reclaim write sync I/O                    0          0          0 
Wake kswapd requests                             0          0          0 
Kswapd wakeups                                   0          0          0 
Kswapd pages scanned                             0          0          0 
Kswapd reclaim write async I/O                   0          0          0 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)              0.00       0.00       0.00 
Time kswapd awake (ms)                        0.00       0.00       0.00 

User/Sys Time Running Test (seconds)       2142.97   2145.36   2145.24
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
Total Elapsed Time (seconds)                787.41    787.54    791.01
Percentage Time kswapd Awake                 0.00%     0.00%     0.00%

kernbench is a straight-forward kernel compile. Kernel is built 5 times and and an
average taken. There was no interesting difference in terms of performance. As the
workload fit easily in memory, there was no page reclaim activity.

SysBench
========
FTrace Reclaim Statistics
                             traceonly-v3r1  stackreduce-v3r1     nodirect-v3r9
Direct reclaims                                595        948       2113 
Direct reclaim pages scanned                223195     236628     145547 
Direct reclaim write async I/O               39437      26136          0 
Direct reclaim write sync I/O                    0         29          0 
Wake kswapd requests                        703495     948725    1747346 
Kswapd wakeups                                1586       1631       1303 
Kswapd pages scanned                      15216731   16883788   15396343 
Kswapd reclaim write async I/O              877359     961730     235228 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)             11.97      25.39      41.78 
Time kswapd awake (ms)                     1702.04    2178.75    2719.17 

User/Sys Time Running Test (seconds)        652.11    684.77    678.44
Percentage Time Spent Direct Reclaim         0.01%     0.00%     0.00%
Total Elapsed Time (seconds)               6033.29   6665.51   6715.48
Percentage Time kswapd Awake                 0.05%     0.00%     0.00%

The results were based on a read/write and as the machine is under-provisioned
for the type of tests, figures are very unstable so not reported.  with
variances up to 15%. Part of the problem is that larger thread counts push
the test into swap as the memory is insufficient and destabilises results
further. I could tune for this, but it was reclaim that was important.

The writing back of dirty pages was a factor. In previous tests, around 2%
of pages encountered scanned were dirtied. In this test, the percentage
was higher at 17% but interestingly, the number of dirty pages encountered
were roughly the same as previous tests and what changed is the number of
pages scanned. I've no useful theory on this yet other than to note that
timing is a very significant factor when analysing the ratio of dirty to
clean pages encountered. Whether swapping occured or not at any given time
is also likely a significant factor.

Between 5-10% of the pages scanned by kswapd need to be written back based
on these three kernels. As the disk was maxed out all of the time, I'm having
trouble deciding whether this is "too much IO" or not but I'm leaning towards
"no". I'd expect that the flusher was also having time getting IO bandwidth.

Overall though, based on a number of tests, the performance with or without
the patches are roughly the same with the main difference being that direct
reclaim is not writing back pages.

Simple Writeback Test
=====================
FTrace Reclaim Statistics
                traceonly-v3r1  stackreduce-v3r1     nodirect-v3r9
Direct reclaims                               2301       2382       2682 
Direct reclaim pages scanned                294528     251080     283305 
Direct reclaim write async I/O                   0          0          0 
Direct reclaim write sync I/O                    0          0          0 
Wake kswapd requests                       1688511    1561619    1718567 
Kswapd wakeups                                1165       1096       1103 
Kswapd pages scanned                      11125671   10993353   11029352 
Kswapd reclaim write async I/O                   0          0          0 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)              0.01       0.01       0.01 
Time kswapd awake (ms)                      305.16     302.31     293.07 

User/Sys Time Running Test (seconds)        103.92    104.48    104.45
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
Total Elapsed Time (seconds)                636.08    638.11    634.76
Percentage Time kswapd Awake                 0.05%     0.00%     0.00%

This test starting with 4 threads, doubling the number of threads on each
iteration up to 64. Each iteration writes 4*RAM amount of files to disk
using dd to dirty memory and conv=fsync to have some sort of stability in
the results.

Direct reclaim writeback was not a problem for this test even though a
number of pages were scanned so there is no reason not to disable it.

Stress HighAlloc
================
                traceonly-v3r1  stackreduce-v3r1     nodirect-v3r9
Pass 1          79.00 ( 0.00%)    77.00 (-2.00%)    74.00 (-5.00%)
Pass 2          80.00 ( 0.00%)    80.00 ( 0.00%)    76.00 (-4.00%)
At Rest         81.00 ( 0.00%)    82.00 ( 1.00%)    83.00 ( 2.00%)

FTrace Reclaim Statistics
                                    traceonly-v3r1  stackreduce-v3r1     nodirect-v3r9
Direct reclaims                               1209       1214       1261 
Direct reclaim pages scanned                177575     141702     163362 
Direct reclaim write async I/O               36877      27783          0 
Direct reclaim write sync I/O                17822      10834          0 
Wake kswapd requests                          4656       1178       4195 
Kswapd wakeups                                 861        854        904 
Kswapd pages scanned                      50583598   33013186   22497685 
Kswapd reclaim write async I/O             3880667    3676997    1363530 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)           7980.65    7836.76    6008.81 
Time kswapd awake (ms)                     5038.21    5329.90    2912.61 

User/Sys Time Running Test (seconds)       2818.55   2827.11   2821.88
Percentage Time Spent Direct Reclaim         0.21%     0.00%     0.00%
Total Elapsed Time (seconds)              10304.58  10151.72   8471.06
Percentage Time kswapd Awake                 0.03%     0.00%     0.00%

This test builds a large number of kernels simultaneously so that the total
workload is 1.5 times the size of RAM. It then attempts to allocate all of
RAM as huge pages. The metric is the percentage of memory allocated using
load (Pass 1), a second attempt under load (Pass 2) and when the kernel
compiles are finishes and the system is quiet (At Rest).

The success figures were comparaible with or without direct reclaim. Unlikely
V2, the success rates on PPC64 are actually improved but not reported here.

Unlike the other tests, synchronous direct writeback is a factor in this test
because of lumpy reclaim. This increases the stall time of a lumpy reclaimer
by quite a margin. Compare the "Time stalled direct reclaim" between the
vanilla and nodirect kernel - the nodirect kernel is stalled less time
but not dramatically less as direct reclaimers stall when dirty pages are
encountered. Interestingly, the time kswapd is stalled is significantly
reduced and the test completes faster.

Memory Pressure leading to OOM
==============================

There were concerns that direct reclaim not writing back pages under heavy memory
pressure could lead to premature OOMs. To test this theory a test was run as follows;

1. Remove existing swap, create a swapfile on XFS and activate it
2. Start 1 kernel compile on XFS per CPU in the system, wait until they start
3. Start 2xNR_CPUs processes writing zero-filled files to XFS. The total size of the
   files was the amount of physical memory on the system
4. Start 2xNR_CPUs processes that map anonymous memory and continually dirty it. The
   total size of the mappings was the amount of physical memory in the system

This stress test was allowed to run for a period of time during which load, swap and IO
activity were all high. No premature OOMs were found.

FTrace Reclaim Statistics
                             traceonly-v3r1  stackreduce-v3r1     nodirect-v3r9
Direct reclaims                              14006       6421      17068 
Direct reclaim pages scanned               1751731     795431    1252629 
Direct reclaim write async I/O               58918      51257          0 
Direct reclaim write sync I/O                   38         27          0 
Wake kswapd requests                       1172938     632220    1082273 
Kswapd wakeups                                 313        260        299 
Kswapd pages scanned                       3824966    3273876    3754542 
Kswapd reclaim write async I/O              860177     460838     565028 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)            377.09     267.92     180.24 
Time kswapd awake (ms)                      295.44     264.65     243.42 

User/Sys Time Running Test (seconds)       8640.01   8668.76   8693.93
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
Total Elapsed Time (seconds)               4185.68   4184.70   4088.38
Percentage Time kswapd Awake                 0.01%     0.00%     0.00%

Interestingly, lumpy reclaim sync writing pages was a factor in this test
which I didn't expect - probably for stack allocations of new processes. Dirty
pages are being encountered by kswapd but as a percentage of scanning,
it's reduced by the patches as well as the amount of time stalled in direct
reclaim and the time kswapd is awake. The main thing is that OOMs were not
unexpectedly triggered.

Overall, this series appears to improve things from an IO perspective -
at least in terms of the amount being generated by the VM and the time
spent handling it. I do have some concerns about the variability of dirty
pages as a ratio of scanned pages encountered but with the tracepoints,
it's something that can be investigated further. Even if we get that ratio
down, it's still a case that direct reclaim should not write pages to avoid
a stack overflow. If writing back pages is found to be a requirement for
whatever reason, nothing in this series prevents a future patch doing direct
writeback in a separate stack but with this approach, more IO is punted to
the flusher threads which should be desirable to the FS folk.

Comments?

 .../trace/postprocess/trace-vmscan-postprocess.pl  |  654 ++++++++++++++++++++
 fs/btrfs/disk-io.c                                 |   21 +-
 fs/xfs/linux-2.6/xfs_aops.c                        |   15 -
 include/linux/memcontrol.h                         |    5 -
 include/linux/mmzone.h                             |   15 -
 include/trace/events/gfpflags.h                    |   37 ++
 include/trace/events/kmem.h                        |   38 +--
 include/trace/events/vmscan.h                      |  184 ++++++
 mm/memcontrol.c                                    |   31 -
 mm/page_alloc.c                                    |    2 -
 mm/vmscan.c                                        |  560 ++++++++++--------
 mm/vmstat.c                                        |    2 -
 12 files changed, 1190 insertions(+), 374 deletions(-)
 create mode 100644 Documentation/trace/postprocess/trace-vmscan-postprocess.pl
 create mode 100644 include/trace/events/gfpflags.h
 create mode 100644 include/trace/events/vmscan.h


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 0/14] Avoid overflowing of stack during page reclaim V3
@ 2010-06-29 11:34 ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

Here is V3 that depends again on flusher threads to do writeback in
direct reclaim rather than stack switching which is not something I'm
likely to get done before xfs/btrfs are ignoring writeback in mainline
(phd sucking up time). Instead, direct reclaimers that encounter dirty
pages call congestion_wait and in the case of lumpy reclaim will wait on
the specific pages. A memory pressure test did not show up premature OOM
problems that some had concerns about.

The details below are long but the short summary is that on balance this
patchset appears to behave better than the vanilla kernel with fewer pages
being written back by the VM, high order allocation under stress performing
quite well and xfs and btrfs both obeying writepage again. ext4 is still
largely ignoring writepage from reclaim context but it'd more significant
legwork to fix that.

Changelog since V2
  o Add acks and reviewed-bys
  o Do not lock multiple pages at the same time for writeback as it's unsafe
  o Drop the clean_page_list function. It alters timing with very little
    benefit. Without the contiguous writing, it doesn't do much to simplify
    the subsequent patches either
  o Throttle processes that encounter dirty pages in direct reclaim. Instead
    wakeup flusher threads to clean the number of pages encountered that were
    dirty
 
Changelog since V1
  o Merge with series that reduces stack usage in page reclaim in general
  o Allow memcg to writeback pages as they are not expected to overflow stack
  o Drop the contiguous-write patch for the moment

There is a problem in the stack depth usage of page reclaim. Particularly
during direct reclaim, it is possible to overflow the stack if it calls
into the filesystems writepage function. This patch series aims to trace
writebacks so it can be evaulated how many dirty pages are being written,
reduce stack usage of page reclaim in general and avoid direct reclaim
writing back pages and overflowing the stack.

The first patch is a fix by Nick Piggin to use lock_page_nosync instead of
lock_page when handling a write error as the mapping may have vanished after the page
was unlocked.

The second 4 patches are a forward-port of trace points that are partly based
on trace points defined by Larry Woodman but never merged. They trace parts of
kswapd, direct reclaim, LRU page isolation and page writeback. The tracepoints
can be used to evaluate what is happening within reclaim and whether things
are getting better or worse. They do not have to be part of the final series
but might be useful during discussion and for later regression testing -
particularly around the percentage of time spent in reclaim.

The 6 patches after that reduce the stack footprint of page reclaim by moving
large allocations out of the main call path. Functionally they should be
similar although there is a timing change on when pages get freed exactly.
This is aimed at giving filesystems as much stack as possible if kswapd is
to writeback pages directly.

Patch 12 prevents direct reclaim writing out pages at all and instead the
flusher threads are asked to clean the number of pages encountered, the caller
waits on congestion and puts the pages back on the LRU.  For lumpy reclaim,
the caller will wait for a time calling the flusher multiple times waiting
on dirty pages to be written out before trying to reclaim the dirty pages a
second time. This increases the responsibility of kswapd somewhat because
it's now cleaning pages on behalf of direct reclaimers but kswapd seemed
a better fit than background flushers to clean pages as it knows where the
pages needing cleaning are. As it's async IO, it should not cause kswapd to
stall (at least until the queue is congested) but the order that pages are
reclaimed on the LRU is altered. Dirty pages that would have been reclaimed
by direct reclaimers are getting another lap on the LRU. The dirty pages
could have been put on a dedicated list but this increased counter overhead
and the number of lists and it is unclear if it is necessary.

The final two patches revert chances on XFS and btrfs that ignore writeback
from reclaim context.

I ran a number of tests with monitoring on X86, X86-64 and PPC64 and I'll
cover what the X86-64 results were here. It's an AMD Phenom 4-core machine
with 2G of RAM with a single disk and the onboard IO controller. Dirty ratio
was left at 20 but tests with 40 did not show up surprises. The filesystem
all the tests were run on was XFS.

Three kernels are compared.

traceonly-v3r1		is the first 4 patches of this series
stackreduce-v3r1	is the first 12 patches of this series
nodirect-v3r9		is all patches in the series

The results on each test is broken up into three parts. The first part
compares the results of the test itself. The second part is a report based
on the ftrace postprocessing script in patch 4 and reports on direct reclaim
and kswapd activity. The third part reports what percentage of time was
spent in direct reclaim and kswapd being awake.

To work out the percentage of time spent in direct reclaim, I used
/usr/bin/time to get the User + Sys CPU time. The stalled time was taken
from the post-processing script.  The total time is (User + Sys + Stall)
and obviously the percentage is of stalled over total time.

kernbench
=========

                traceonly-v3r1  stackreduce-v3r1     nodirect-v3r9
Elapsed min       98.00 ( 0.00%)    98.26 (-0.27%)    98.11 (-0.11%)
Elapsed mean      98.19 ( 0.00%)    98.35 (-0.16%)    98.22 (-0.03%)
Elapsed stddev     0.16 ( 0.00%)     0.06 (62.07%)     0.11 (32.71%)
Elapsed max       98.44 ( 0.00%)    98.42 ( 0.02%)    98.39 ( 0.05%)
User    min      310.43 ( 0.00%)   311.14 (-0.23%)   311.76 (-0.43%)
User    mean     311.34 ( 0.00%)   311.48 (-0.04%)   312.03 (-0.22%)
User    stddev     0.76 ( 0.00%)     0.34 (55.52%)     0.20 (73.54%)
User    max      312.51 ( 0.00%)   311.86 ( 0.21%)   312.30 ( 0.07%)
System  min       39.62 ( 0.00%)    40.76 (-2.88%)    40.03 (-1.03%)
System  mean      40.44 ( 0.00%)    40.92 (-1.20%)    40.18 ( 0.64%)
System  stddev     0.53 ( 0.00%)     0.20 (62.94%)     0.09 (82.81%)
System  max       41.11 ( 0.00%)    41.25 (-0.34%)    40.28 ( 2.02%)
CPU     min      357.00 ( 0.00%)   358.00 (-0.28%)   358.00 (-0.28%)
CPU     mean     357.75 ( 0.00%)   358.00 (-0.07%)   358.00 (-0.07%)
CPU     stddev     0.43 ( 0.00%)     0.00 (100.00%)     0.00 (100.00%)
CPU     max      358.00 ( 0.00%)   358.00 ( 0.00%)   358.00 ( 0.00%)
FTrace Reclaim Statistics
                                    traceonly-v3r1  stackreduce-v3r1   nodirect-v3r9
Direct reclaims                                  0          0          0 
Direct reclaim pages scanned                     0          0          0 
Direct reclaim write async I/O                   0          0          0 
Direct reclaim write sync I/O                    0          0          0 
Wake kswapd requests                             0          0          0 
Kswapd wakeups                                   0          0          0 
Kswapd pages scanned                             0          0          0 
Kswapd reclaim write async I/O                   0          0          0 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)              0.00       0.00       0.00 
Time kswapd awake (ms)                        0.00       0.00       0.00 

User/Sys Time Running Test (seconds)       2142.97   2145.36   2145.24
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
Total Elapsed Time (seconds)                787.41    787.54    791.01
Percentage Time kswapd Awake                 0.00%     0.00%     0.00%

kernbench is a straight-forward kernel compile. Kernel is built 5 times and and an
average taken. There was no interesting difference in terms of performance. As the
workload fit easily in memory, there was no page reclaim activity.

SysBench
========
FTrace Reclaim Statistics
                             traceonly-v3r1  stackreduce-v3r1     nodirect-v3r9
Direct reclaims                                595        948       2113 
Direct reclaim pages scanned                223195     236628     145547 
Direct reclaim write async I/O               39437      26136          0 
Direct reclaim write sync I/O                    0         29          0 
Wake kswapd requests                        703495     948725    1747346 
Kswapd wakeups                                1586       1631       1303 
Kswapd pages scanned                      15216731   16883788   15396343 
Kswapd reclaim write async I/O              877359     961730     235228 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)             11.97      25.39      41.78 
Time kswapd awake (ms)                     1702.04    2178.75    2719.17 

User/Sys Time Running Test (seconds)        652.11    684.77    678.44
Percentage Time Spent Direct Reclaim         0.01%     0.00%     0.00%
Total Elapsed Time (seconds)               6033.29   6665.51   6715.48
Percentage Time kswapd Awake                 0.05%     0.00%     0.00%

The results were based on a read/write and as the machine is under-provisioned
for the type of tests, figures are very unstable so not reported.  with
variances up to 15%. Part of the problem is that larger thread counts push
the test into swap as the memory is insufficient and destabilises results
further. I could tune for this, but it was reclaim that was important.

The writing back of dirty pages was a factor. In previous tests, around 2%
of pages encountered scanned were dirtied. In this test, the percentage
was higher at 17% but interestingly, the number of dirty pages encountered
were roughly the same as previous tests and what changed is the number of
pages scanned. I've no useful theory on this yet other than to note that
timing is a very significant factor when analysing the ratio of dirty to
clean pages encountered. Whether swapping occured or not at any given time
is also likely a significant factor.

Between 5-10% of the pages scanned by kswapd need to be written back based
on these three kernels. As the disk was maxed out all of the time, I'm having
trouble deciding whether this is "too much IO" or not but I'm leaning towards
"no". I'd expect that the flusher was also having time getting IO bandwidth.

Overall though, based on a number of tests, the performance with or without
the patches are roughly the same with the main difference being that direct
reclaim is not writing back pages.

Simple Writeback Test
=====================
FTrace Reclaim Statistics
                traceonly-v3r1  stackreduce-v3r1     nodirect-v3r9
Direct reclaims                               2301       2382       2682 
Direct reclaim pages scanned                294528     251080     283305 
Direct reclaim write async I/O                   0          0          0 
Direct reclaim write sync I/O                    0          0          0 
Wake kswapd requests                       1688511    1561619    1718567 
Kswapd wakeups                                1165       1096       1103 
Kswapd pages scanned                      11125671   10993353   11029352 
Kswapd reclaim write async I/O                   0          0          0 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)              0.01       0.01       0.01 
Time kswapd awake (ms)                      305.16     302.31     293.07 

User/Sys Time Running Test (seconds)        103.92    104.48    104.45
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
Total Elapsed Time (seconds)                636.08    638.11    634.76
Percentage Time kswapd Awake                 0.05%     0.00%     0.00%

This test starting with 4 threads, doubling the number of threads on each
iteration up to 64. Each iteration writes 4*RAM amount of files to disk
using dd to dirty memory and conv=fsync to have some sort of stability in
the results.

Direct reclaim writeback was not a problem for this test even though a
number of pages were scanned so there is no reason not to disable it.

Stress HighAlloc
================
                traceonly-v3r1  stackreduce-v3r1     nodirect-v3r9
Pass 1          79.00 ( 0.00%)    77.00 (-2.00%)    74.00 (-5.00%)
Pass 2          80.00 ( 0.00%)    80.00 ( 0.00%)    76.00 (-4.00%)
At Rest         81.00 ( 0.00%)    82.00 ( 1.00%)    83.00 ( 2.00%)

FTrace Reclaim Statistics
                                    traceonly-v3r1  stackreduce-v3r1     nodirect-v3r9
Direct reclaims                               1209       1214       1261 
Direct reclaim pages scanned                177575     141702     163362 
Direct reclaim write async I/O               36877      27783          0 
Direct reclaim write sync I/O                17822      10834          0 
Wake kswapd requests                          4656       1178       4195 
Kswapd wakeups                                 861        854        904 
Kswapd pages scanned                      50583598   33013186   22497685 
Kswapd reclaim write async I/O             3880667    3676997    1363530 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)           7980.65    7836.76    6008.81 
Time kswapd awake (ms)                     5038.21    5329.90    2912.61 

User/Sys Time Running Test (seconds)       2818.55   2827.11   2821.88
Percentage Time Spent Direct Reclaim         0.21%     0.00%     0.00%
Total Elapsed Time (seconds)              10304.58  10151.72   8471.06
Percentage Time kswapd Awake                 0.03%     0.00%     0.00%

This test builds a large number of kernels simultaneously so that the total
workload is 1.5 times the size of RAM. It then attempts to allocate all of
RAM as huge pages. The metric is the percentage of memory allocated using
load (Pass 1), a second attempt under load (Pass 2) and when the kernel
compiles are finishes and the system is quiet (At Rest).

The success figures were comparaible with or without direct reclaim. Unlikely
V2, the success rates on PPC64 are actually improved but not reported here.

Unlike the other tests, synchronous direct writeback is a factor in this test
because of lumpy reclaim. This increases the stall time of a lumpy reclaimer
by quite a margin. Compare the "Time stalled direct reclaim" between the
vanilla and nodirect kernel - the nodirect kernel is stalled less time
but not dramatically less as direct reclaimers stall when dirty pages are
encountered. Interestingly, the time kswapd is stalled is significantly
reduced and the test completes faster.

Memory Pressure leading to OOM
==============================

There were concerns that direct reclaim not writing back pages under heavy memory
pressure could lead to premature OOMs. To test this theory a test was run as follows;

1. Remove existing swap, create a swapfile on XFS and activate it
2. Start 1 kernel compile on XFS per CPU in the system, wait until they start
3. Start 2xNR_CPUs processes writing zero-filled files to XFS. The total size of the
   files was the amount of physical memory on the system
4. Start 2xNR_CPUs processes that map anonymous memory and continually dirty it. The
   total size of the mappings was the amount of physical memory in the system

This stress test was allowed to run for a period of time during which load, swap and IO
activity were all high. No premature OOMs were found.

FTrace Reclaim Statistics
                             traceonly-v3r1  stackreduce-v3r1     nodirect-v3r9
Direct reclaims                              14006       6421      17068 
Direct reclaim pages scanned               1751731     795431    1252629 
Direct reclaim write async I/O               58918      51257          0 
Direct reclaim write sync I/O                   38         27          0 
Wake kswapd requests                       1172938     632220    1082273 
Kswapd wakeups                                 313        260        299 
Kswapd pages scanned                       3824966    3273876    3754542 
Kswapd reclaim write async I/O              860177     460838     565028 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)            377.09     267.92     180.24 
Time kswapd awake (ms)                      295.44     264.65     243.42 

User/Sys Time Running Test (seconds)       8640.01   8668.76   8693.93
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
Total Elapsed Time (seconds)               4185.68   4184.70   4088.38
Percentage Time kswapd Awake                 0.01%     0.00%     0.00%

Interestingly, lumpy reclaim sync writing pages was a factor in this test
which I didn't expect - probably for stack allocations of new processes. Dirty
pages are being encountered by kswapd but as a percentage of scanning,
it's reduced by the patches as well as the amount of time stalled in direct
reclaim and the time kswapd is awake. The main thing is that OOMs were not
unexpectedly triggered.

Overall, this series appears to improve things from an IO perspective -
at least in terms of the amount being generated by the VM and the time
spent handling it. I do have some concerns about the variability of dirty
pages as a ratio of scanned pages encountered but with the tracepoints,
it's something that can be investigated further. Even if we get that ratio
down, it's still a case that direct reclaim should not write pages to avoid
a stack overflow. If writing back pages is found to be a requirement for
whatever reason, nothing in this series prevents a future patch doing direct
writeback in a separate stack but with this approach, more IO is punted to
the flusher threads which should be desirable to the FS folk.

Comments?

 .../trace/postprocess/trace-vmscan-postprocess.pl  |  654 ++++++++++++++++++++
 fs/btrfs/disk-io.c                                 |   21 +-
 fs/xfs/linux-2.6/xfs_aops.c                        |   15 -
 include/linux/memcontrol.h                         |    5 -
 include/linux/mmzone.h                             |   15 -
 include/trace/events/gfpflags.h                    |   37 ++
 include/trace/events/kmem.h                        |   38 +--
 include/trace/events/vmscan.h                      |  184 ++++++
 mm/memcontrol.c                                    |   31 -
 mm/page_alloc.c                                    |    2 -
 mm/vmscan.c                                        |  560 ++++++++++--------
 mm/vmstat.c                                        |    2 -
 12 files changed, 1190 insertions(+), 374 deletions(-)
 create mode 100644 Documentation/trace/postprocess/trace-vmscan-postprocess.pl
 create mode 100644 include/trace/events/gfpflags.h
 create mode 100644 include/trace/events/vmscan.h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 01/14] vmscan: Fix mapping use after free
  2010-06-29 11:34 ` Mel Gorman
@ 2010-06-29 11:34   ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

From: Nick Piggin <npiggin@suse.de>

Use lock_page_nosync in handle_write_error as after writepage we have no
reference to the mapping when taking the page lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c7e57c..62a30fe 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -296,7 +296,7 @@ static int may_write_to_queue(struct backing_dev_info *bdi)
 static void handle_write_error(struct address_space *mapping,
 				struct page *page, int error)
 {
-	lock_page(page);
+	lock_page_nosync(page);
 	if (page_mapping(page) == mapping)
 		mapping_set_error(mapping, error);
 	unlock_page(page);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 01/14] vmscan: Fix mapping use after free
@ 2010-06-29 11:34   ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

From: Nick Piggin <npiggin@suse.de>

Use lock_page_nosync in handle_write_error as after writepage we have no
reference to the mapping when taking the page lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c7e57c..62a30fe 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -296,7 +296,7 @@ static int may_write_to_queue(struct backing_dev_info *bdi)
 static void handle_write_error(struct address_space *mapping,
 				struct page *page, int error)
 {
-	lock_page(page);
+	lock_page_nosync(page);
 	if (page_mapping(page) == mapping)
 		mapping_set_error(mapping, error);
 	unlock_page(page);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 02/14] tracing, vmscan: Add trace events for kswapd wakeup, sleeping and direct reclaim
  2010-06-29 11:34 ` Mel Gorman
@ 2010-06-29 11:34   ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

This patch adds two trace events for kswapd waking up and going asleep for
the purposes of tracking kswapd activity and two trace events for direct
reclaim beginning and ending. The information can be used to work out how
much time a process or the system is spending on the reclamation of pages
and in the case of direct reclaim, how many pages were reclaimed for that
process.  High frequency triggering of these events could point to memory
pressure problems.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
---
 include/trace/events/gfpflags.h |   37 +++++++++++++
 include/trace/events/kmem.h     |   38 +-------------
 include/trace/events/vmscan.h   |  115 +++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                     |   24 +++++++--
 4 files changed, 173 insertions(+), 41 deletions(-)
 create mode 100644 include/trace/events/gfpflags.h
 create mode 100644 include/trace/events/vmscan.h

diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
new file mode 100644
index 0000000..e3615c0
--- /dev/null
+++ b/include/trace/events/gfpflags.h
@@ -0,0 +1,37 @@
+/*
+ * The order of these masks is important. Matching masks will be seen
+ * first and the left over flags will end up showing by themselves.
+ *
+ * For example, if we have GFP_KERNEL before GFP_USER we wil get:
+ *
+ *  GFP_KERNEL|GFP_HARDWALL
+ *
+ * Thus most bits set go first.
+ */
+#define show_gfp_flags(flags)						\
+	(flags) ? __print_flags(flags, "|",				\
+	{(unsigned long)GFP_HIGHUSER_MOVABLE,	"GFP_HIGHUSER_MOVABLE"}, \
+	{(unsigned long)GFP_HIGHUSER,		"GFP_HIGHUSER"},	\
+	{(unsigned long)GFP_USER,		"GFP_USER"},		\
+	{(unsigned long)GFP_TEMPORARY,		"GFP_TEMPORARY"},	\
+	{(unsigned long)GFP_KERNEL,		"GFP_KERNEL"},		\
+	{(unsigned long)GFP_NOFS,		"GFP_NOFS"},		\
+	{(unsigned long)GFP_ATOMIC,		"GFP_ATOMIC"},		\
+	{(unsigned long)GFP_NOIO,		"GFP_NOIO"},		\
+	{(unsigned long)__GFP_HIGH,		"GFP_HIGH"},		\
+	{(unsigned long)__GFP_WAIT,		"GFP_WAIT"},		\
+	{(unsigned long)__GFP_IO,		"GFP_IO"},		\
+	{(unsigned long)__GFP_COLD,		"GFP_COLD"},		\
+	{(unsigned long)__GFP_NOWARN,		"GFP_NOWARN"},		\
+	{(unsigned long)__GFP_REPEAT,		"GFP_REPEAT"},		\
+	{(unsigned long)__GFP_NOFAIL,		"GFP_NOFAIL"},		\
+	{(unsigned long)__GFP_NORETRY,		"GFP_NORETRY"},		\
+	{(unsigned long)__GFP_COMP,		"GFP_COMP"},		\
+	{(unsigned long)__GFP_ZERO,		"GFP_ZERO"},		\
+	{(unsigned long)__GFP_NOMEMALLOC,	"GFP_NOMEMALLOC"},	\
+	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
+	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
+	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
+	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"}		\
+	) : "GFP_NOWAIT"
+
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 3adca0c..a9c87ad 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -6,43 +6,7 @@
 
 #include <linux/types.h>
 #include <linux/tracepoint.h>
-
-/*
- * The order of these masks is important. Matching masks will be seen
- * first and the left over flags will end up showing by themselves.
- *
- * For example, if we have GFP_KERNEL before GFP_USER we wil get:
- *
- *  GFP_KERNEL|GFP_HARDWALL
- *
- * Thus most bits set go first.
- */
-#define show_gfp_flags(flags)						\
-	(flags) ? __print_flags(flags, "|",				\
-	{(unsigned long)GFP_HIGHUSER_MOVABLE,	"GFP_HIGHUSER_MOVABLE"}, \
-	{(unsigned long)GFP_HIGHUSER,		"GFP_HIGHUSER"},	\
-	{(unsigned long)GFP_USER,		"GFP_USER"},		\
-	{(unsigned long)GFP_TEMPORARY,		"GFP_TEMPORARY"},	\
-	{(unsigned long)GFP_KERNEL,		"GFP_KERNEL"},		\
-	{(unsigned long)GFP_NOFS,		"GFP_NOFS"},		\
-	{(unsigned long)GFP_ATOMIC,		"GFP_ATOMIC"},		\
-	{(unsigned long)GFP_NOIO,		"GFP_NOIO"},		\
-	{(unsigned long)__GFP_HIGH,		"GFP_HIGH"},		\
-	{(unsigned long)__GFP_WAIT,		"GFP_WAIT"},		\
-	{(unsigned long)__GFP_IO,		"GFP_IO"},		\
-	{(unsigned long)__GFP_COLD,		"GFP_COLD"},		\
-	{(unsigned long)__GFP_NOWARN,		"GFP_NOWARN"},		\
-	{(unsigned long)__GFP_REPEAT,		"GFP_REPEAT"},		\
-	{(unsigned long)__GFP_NOFAIL,		"GFP_NOFAIL"},		\
-	{(unsigned long)__GFP_NORETRY,		"GFP_NORETRY"},		\
-	{(unsigned long)__GFP_COMP,		"GFP_COMP"},		\
-	{(unsigned long)__GFP_ZERO,		"GFP_ZERO"},		\
-	{(unsigned long)__GFP_NOMEMALLOC,	"GFP_NOMEMALLOC"},	\
-	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
-	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
-	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
-	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"}		\
-	) : "GFP_NOWAIT"
+#include "gfpflags.h"
 
 DECLARE_EVENT_CLASS(kmem_alloc,
 
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
new file mode 100644
index 0000000..f76521f
--- /dev/null
+++ b/include/trace/events/vmscan.h
@@ -0,0 +1,115 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vmscan
+
+#if !defined(_TRACE_VMSCAN_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VMSCAN_H
+
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+#include "gfpflags.h"
+
+TRACE_EVENT(mm_vmscan_kswapd_sleep,
+
+	TP_PROTO(int nid),
+
+	TP_ARGS(nid),
+
+	TP_STRUCT__entry(
+		__field(	int,	nid	)
+	),
+
+	TP_fast_assign(
+		__entry->nid	= nid;
+	),
+
+	TP_printk("nid=%d", __entry->nid)
+);
+
+TRACE_EVENT(mm_vmscan_kswapd_wake,
+
+	TP_PROTO(int nid, int order),
+
+	TP_ARGS(nid, order),
+
+	TP_STRUCT__entry(
+		__field(	int,	nid	)
+		__field(	int,	order	)
+	),
+
+	TP_fast_assign(
+		__entry->nid	= nid;
+		__entry->order	= order;
+	),
+
+	TP_printk("nid=%d order=%d", __entry->nid, __entry->order)
+);
+
+TRACE_EVENT(mm_vmscan_wakeup_kswapd,
+
+	TP_PROTO(int nid, int zid, int order),
+
+	TP_ARGS(nid, zid, order),
+
+	TP_STRUCT__entry(
+		__field(	int,		nid	)
+		__field(	int,		zid	)
+		__field(	int,		order	)
+	),
+
+	TP_fast_assign(
+		__entry->nid		= nid;
+		__entry->zid		= zid;
+		__entry->order		= order;
+	),
+
+	TP_printk("nid=%d zid=%d order=%d",
+		__entry->nid,
+		__entry->zid,
+		__entry->order)
+);
+
+TRACE_EVENT(mm_vmscan_direct_reclaim_begin,
+
+	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags),
+
+	TP_ARGS(order, may_writepage, gfp_flags),
+
+	TP_STRUCT__entry(
+		__field(	int,	order		)
+		__field(	int,	may_writepage	)
+		__field(	gfp_t,	gfp_flags	)
+	),
+
+	TP_fast_assign(
+		__entry->order		= order;
+		__entry->may_writepage	= may_writepage;
+		__entry->gfp_flags	= gfp_flags;
+	),
+
+	TP_printk("order=%d may_writepage=%d gfp_flags=%s",
+		__entry->order,
+		__entry->may_writepage,
+		show_gfp_flags(__entry->gfp_flags))
+);
+
+TRACE_EVENT(mm_vmscan_direct_reclaim_end,
+
+	TP_PROTO(unsigned long nr_reclaimed),
+
+	TP_ARGS(nr_reclaimed),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	nr_reclaimed	)
+	),
+
+	TP_fast_assign(
+		__entry->nr_reclaimed	= nr_reclaimed;
+	),
+
+	TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed)
+);
+
+#endif /* _TRACE_VMSCAN_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 62a30fe..d425cef 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -48,6 +48,9 @@
 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/vmscan.h>
+
 struct scan_control {
 	/* Incremented by the number of inactive pages that were scanned */
 	unsigned long nr_scanned;
@@ -1886,6 +1889,7 @@ out:
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 				gfp_t gfp_mask, nodemask_t *nodemask)
 {
+	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
 		.may_writepage = !laptop_mode,
@@ -1898,7 +1902,15 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.nodemask = nodemask,
 	};
 
-	return do_try_to_free_pages(zonelist, &sc);
+	trace_mm_vmscan_direct_reclaim_begin(order,
+				sc.may_writepage,
+				gfp_mask);
+
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+
+	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
+
+	return nr_reclaimed;
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
@@ -2297,9 +2309,10 @@ static int kswapd(void *p)
 				 * premature sleep. If not, then go fully
 				 * to sleep until explicitly woken up
 				 */
-				if (!sleeping_prematurely(pgdat, order, remaining))
+				if (!sleeping_prematurely(pgdat, order, remaining)) {
+					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 					schedule();
-				else {
+				} else {
 					if (remaining)
 						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
 					else
@@ -2319,8 +2332,10 @@ static int kswapd(void *p)
 		 * We can speed up thawing tasks if we don't call balance_pgdat
 		 * after returning from the refrigerator
 		 */
-		if (!ret)
+		if (!ret) {
+			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
 			balance_pgdat(pgdat, order);
+		}
 	}
 	return 0;
 }
@@ -2340,6 +2355,7 @@ void wakeup_kswapd(struct zone *zone, int order)
 		return;
 	if (pgdat->kswapd_max_order < order)
 		pgdat->kswapd_max_order = order;
+	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 		return;
 	if (!waitqueue_active(&pgdat->kswapd_wait))
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 02/14] tracing, vmscan: Add trace events for kswapd wakeup, sleeping and direct reclaim
@ 2010-06-29 11:34   ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

This patch adds two trace events for kswapd waking up and going asleep for
the purposes of tracking kswapd activity and two trace events for direct
reclaim beginning and ending. The information can be used to work out how
much time a process or the system is spending on the reclamation of pages
and in the case of direct reclaim, how many pages were reclaimed for that
process.  High frequency triggering of these events could point to memory
pressure problems.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
---
 include/trace/events/gfpflags.h |   37 +++++++++++++
 include/trace/events/kmem.h     |   38 +-------------
 include/trace/events/vmscan.h   |  115 +++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                     |   24 +++++++--
 4 files changed, 173 insertions(+), 41 deletions(-)
 create mode 100644 include/trace/events/gfpflags.h
 create mode 100644 include/trace/events/vmscan.h

diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
new file mode 100644
index 0000000..e3615c0
--- /dev/null
+++ b/include/trace/events/gfpflags.h
@@ -0,0 +1,37 @@
+/*
+ * The order of these masks is important. Matching masks will be seen
+ * first and the left over flags will end up showing by themselves.
+ *
+ * For example, if we have GFP_KERNEL before GFP_USER we wil get:
+ *
+ *  GFP_KERNEL|GFP_HARDWALL
+ *
+ * Thus most bits set go first.
+ */
+#define show_gfp_flags(flags)						\
+	(flags) ? __print_flags(flags, "|",				\
+	{(unsigned long)GFP_HIGHUSER_MOVABLE,	"GFP_HIGHUSER_MOVABLE"}, \
+	{(unsigned long)GFP_HIGHUSER,		"GFP_HIGHUSER"},	\
+	{(unsigned long)GFP_USER,		"GFP_USER"},		\
+	{(unsigned long)GFP_TEMPORARY,		"GFP_TEMPORARY"},	\
+	{(unsigned long)GFP_KERNEL,		"GFP_KERNEL"},		\
+	{(unsigned long)GFP_NOFS,		"GFP_NOFS"},		\
+	{(unsigned long)GFP_ATOMIC,		"GFP_ATOMIC"},		\
+	{(unsigned long)GFP_NOIO,		"GFP_NOIO"},		\
+	{(unsigned long)__GFP_HIGH,		"GFP_HIGH"},		\
+	{(unsigned long)__GFP_WAIT,		"GFP_WAIT"},		\
+	{(unsigned long)__GFP_IO,		"GFP_IO"},		\
+	{(unsigned long)__GFP_COLD,		"GFP_COLD"},		\
+	{(unsigned long)__GFP_NOWARN,		"GFP_NOWARN"},		\
+	{(unsigned long)__GFP_REPEAT,		"GFP_REPEAT"},		\
+	{(unsigned long)__GFP_NOFAIL,		"GFP_NOFAIL"},		\
+	{(unsigned long)__GFP_NORETRY,		"GFP_NORETRY"},		\
+	{(unsigned long)__GFP_COMP,		"GFP_COMP"},		\
+	{(unsigned long)__GFP_ZERO,		"GFP_ZERO"},		\
+	{(unsigned long)__GFP_NOMEMALLOC,	"GFP_NOMEMALLOC"},	\
+	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
+	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
+	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
+	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"}		\
+	) : "GFP_NOWAIT"
+
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 3adca0c..a9c87ad 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -6,43 +6,7 @@
 
 #include <linux/types.h>
 #include <linux/tracepoint.h>
-
-/*
- * The order of these masks is important. Matching masks will be seen
- * first and the left over flags will end up showing by themselves.
- *
- * For example, if we have GFP_KERNEL before GFP_USER we wil get:
- *
- *  GFP_KERNEL|GFP_HARDWALL
- *
- * Thus most bits set go first.
- */
-#define show_gfp_flags(flags)						\
-	(flags) ? __print_flags(flags, "|",				\
-	{(unsigned long)GFP_HIGHUSER_MOVABLE,	"GFP_HIGHUSER_MOVABLE"}, \
-	{(unsigned long)GFP_HIGHUSER,		"GFP_HIGHUSER"},	\
-	{(unsigned long)GFP_USER,		"GFP_USER"},		\
-	{(unsigned long)GFP_TEMPORARY,		"GFP_TEMPORARY"},	\
-	{(unsigned long)GFP_KERNEL,		"GFP_KERNEL"},		\
-	{(unsigned long)GFP_NOFS,		"GFP_NOFS"},		\
-	{(unsigned long)GFP_ATOMIC,		"GFP_ATOMIC"},		\
-	{(unsigned long)GFP_NOIO,		"GFP_NOIO"},		\
-	{(unsigned long)__GFP_HIGH,		"GFP_HIGH"},		\
-	{(unsigned long)__GFP_WAIT,		"GFP_WAIT"},		\
-	{(unsigned long)__GFP_IO,		"GFP_IO"},		\
-	{(unsigned long)__GFP_COLD,		"GFP_COLD"},		\
-	{(unsigned long)__GFP_NOWARN,		"GFP_NOWARN"},		\
-	{(unsigned long)__GFP_REPEAT,		"GFP_REPEAT"},		\
-	{(unsigned long)__GFP_NOFAIL,		"GFP_NOFAIL"},		\
-	{(unsigned long)__GFP_NORETRY,		"GFP_NORETRY"},		\
-	{(unsigned long)__GFP_COMP,		"GFP_COMP"},		\
-	{(unsigned long)__GFP_ZERO,		"GFP_ZERO"},		\
-	{(unsigned long)__GFP_NOMEMALLOC,	"GFP_NOMEMALLOC"},	\
-	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
-	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
-	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
-	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"}		\
-	) : "GFP_NOWAIT"
+#include "gfpflags.h"
 
 DECLARE_EVENT_CLASS(kmem_alloc,
 
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
new file mode 100644
index 0000000..f76521f
--- /dev/null
+++ b/include/trace/events/vmscan.h
@@ -0,0 +1,115 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vmscan
+
+#if !defined(_TRACE_VMSCAN_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VMSCAN_H
+
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+#include "gfpflags.h"
+
+TRACE_EVENT(mm_vmscan_kswapd_sleep,
+
+	TP_PROTO(int nid),
+
+	TP_ARGS(nid),
+
+	TP_STRUCT__entry(
+		__field(	int,	nid	)
+	),
+
+	TP_fast_assign(
+		__entry->nid	= nid;
+	),
+
+	TP_printk("nid=%d", __entry->nid)
+);
+
+TRACE_EVENT(mm_vmscan_kswapd_wake,
+
+	TP_PROTO(int nid, int order),
+
+	TP_ARGS(nid, order),
+
+	TP_STRUCT__entry(
+		__field(	int,	nid	)
+		__field(	int,	order	)
+	),
+
+	TP_fast_assign(
+		__entry->nid	= nid;
+		__entry->order	= order;
+	),
+
+	TP_printk("nid=%d order=%d", __entry->nid, __entry->order)
+);
+
+TRACE_EVENT(mm_vmscan_wakeup_kswapd,
+
+	TP_PROTO(int nid, int zid, int order),
+
+	TP_ARGS(nid, zid, order),
+
+	TP_STRUCT__entry(
+		__field(	int,		nid	)
+		__field(	int,		zid	)
+		__field(	int,		order	)
+	),
+
+	TP_fast_assign(
+		__entry->nid		= nid;
+		__entry->zid		= zid;
+		__entry->order		= order;
+	),
+
+	TP_printk("nid=%d zid=%d order=%d",
+		__entry->nid,
+		__entry->zid,
+		__entry->order)
+);
+
+TRACE_EVENT(mm_vmscan_direct_reclaim_begin,
+
+	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags),
+
+	TP_ARGS(order, may_writepage, gfp_flags),
+
+	TP_STRUCT__entry(
+		__field(	int,	order		)
+		__field(	int,	may_writepage	)
+		__field(	gfp_t,	gfp_flags	)
+	),
+
+	TP_fast_assign(
+		__entry->order		= order;
+		__entry->may_writepage	= may_writepage;
+		__entry->gfp_flags	= gfp_flags;
+	),
+
+	TP_printk("order=%d may_writepage=%d gfp_flags=%s",
+		__entry->order,
+		__entry->may_writepage,
+		show_gfp_flags(__entry->gfp_flags))
+);
+
+TRACE_EVENT(mm_vmscan_direct_reclaim_end,
+
+	TP_PROTO(unsigned long nr_reclaimed),
+
+	TP_ARGS(nr_reclaimed),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	nr_reclaimed	)
+	),
+
+	TP_fast_assign(
+		__entry->nr_reclaimed	= nr_reclaimed;
+	),
+
+	TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed)
+);
+
+#endif /* _TRACE_VMSCAN_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 62a30fe..d425cef 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -48,6 +48,9 @@
 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/vmscan.h>
+
 struct scan_control {
 	/* Incremented by the number of inactive pages that were scanned */
 	unsigned long nr_scanned;
@@ -1886,6 +1889,7 @@ out:
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 				gfp_t gfp_mask, nodemask_t *nodemask)
 {
+	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
 		.may_writepage = !laptop_mode,
@@ -1898,7 +1902,15 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.nodemask = nodemask,
 	};
 
-	return do_try_to_free_pages(zonelist, &sc);
+	trace_mm_vmscan_direct_reclaim_begin(order,
+				sc.may_writepage,
+				gfp_mask);
+
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+
+	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
+
+	return nr_reclaimed;
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
@@ -2297,9 +2309,10 @@ static int kswapd(void *p)
 				 * premature sleep. If not, then go fully
 				 * to sleep until explicitly woken up
 				 */
-				if (!sleeping_prematurely(pgdat, order, remaining))
+				if (!sleeping_prematurely(pgdat, order, remaining)) {
+					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 					schedule();
-				else {
+				} else {
 					if (remaining)
 						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
 					else
@@ -2319,8 +2332,10 @@ static int kswapd(void *p)
 		 * We can speed up thawing tasks if we don't call balance_pgdat
 		 * after returning from the refrigerator
 		 */
-		if (!ret)
+		if (!ret) {
+			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
 			balance_pgdat(pgdat, order);
+		}
 	}
 	return 0;
 }
@@ -2340,6 +2355,7 @@ void wakeup_kswapd(struct zone *zone, int order)
 		return;
 	if (pgdat->kswapd_max_order < order)
 		pgdat->kswapd_max_order = order;
+	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 		return;
 	if (!waitqueue_active(&pgdat->kswapd_wait))
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 03/14] tracing, vmscan: Add trace events for LRU page isolation
  2010-06-29 11:34 ` Mel Gorman
@ 2010-06-29 11:34   ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

This patch adds an event for when pages are isolated en-masse from the
LRU lists. This event augments the information available on LRU traffic
and can be used to evaluate lumpy reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
---
 include/trace/events/vmscan.h |   46 +++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                   |   14 ++++++++++++
 2 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index f76521f..a331454 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -109,6 +109,52 @@ TRACE_EVENT(mm_vmscan_direct_reclaim_end,
 	TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed)
 );
 
+TRACE_EVENT(mm_vmscan_lru_isolate,
+
+	TP_PROTO(int order,
+		unsigned long nr_requested,
+		unsigned long nr_scanned,
+		unsigned long nr_taken,
+		unsigned long nr_lumpy_taken,
+		unsigned long nr_lumpy_dirty,
+		unsigned long nr_lumpy_failed,
+		int isolate_mode),
+
+	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, isolate_mode),
+
+	TP_STRUCT__entry(
+		__field(int, order)
+		__field(unsigned long, nr_requested)
+		__field(unsigned long, nr_scanned)
+		__field(unsigned long, nr_taken)
+		__field(unsigned long, nr_lumpy_taken)
+		__field(unsigned long, nr_lumpy_dirty)
+		__field(unsigned long, nr_lumpy_failed)
+		__field(int, isolate_mode)
+	),
+
+	TP_fast_assign(
+		__entry->order = order;
+		__entry->nr_requested = nr_requested;
+		__entry->nr_scanned = nr_scanned;
+		__entry->nr_taken = nr_taken;
+		__entry->nr_lumpy_taken = nr_lumpy_taken;
+		__entry->nr_lumpy_dirty = nr_lumpy_dirty;
+		__entry->nr_lumpy_failed = nr_lumpy_failed;
+		__entry->isolate_mode = isolate_mode;
+	),
+
+	TP_printk("isolate_mode=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu contig_taken=%lu contig_dirty=%lu contig_failed=%lu",
+		__entry->isolate_mode,
+		__entry->order,
+		__entry->nr_requested,
+		__entry->nr_scanned,
+		__entry->nr_taken,
+		__entry->nr_lumpy_taken,
+		__entry->nr_lumpy_dirty,
+		__entry->nr_lumpy_failed)
+);
+		
 #endif /* _TRACE_VMSCAN_H */
 
 /* This part must be outside protection */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d425cef..095c66c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -917,6 +917,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		unsigned long *scanned, int order, int mode, int file)
 {
 	unsigned long nr_taken = 0;
+	unsigned long nr_lumpy_taken = 0, nr_lumpy_dirty = 0, nr_lumpy_failed = 0;
 	unsigned long scan;
 
 	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
@@ -994,12 +995,25 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 				list_move(&cursor_page->lru, dst);
 				mem_cgroup_del_lru(cursor_page);
 				nr_taken++;
+				nr_lumpy_taken++;
+				if (PageDirty(cursor_page))
+					nr_lumpy_dirty++;
 				scan++;
+			} else {
+				if (mode == ISOLATE_BOTH &&
+						page_count(cursor_page))
+					nr_lumpy_failed++;
 			}
 		}
 	}
 
 	*scanned = scan;
+
+	trace_mm_vmscan_lru_isolate(order,
+			nr_to_scan, scan,
+			nr_taken,
+			nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed,
+			mode);
 	return nr_taken;
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 03/14] tracing, vmscan: Add trace events for LRU page isolation
@ 2010-06-29 11:34   ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

This patch adds an event for when pages are isolated en-masse from the
LRU lists. This event augments the information available on LRU traffic
and can be used to evaluate lumpy reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
---
 include/trace/events/vmscan.h |   46 +++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                   |   14 ++++++++++++
 2 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index f76521f..a331454 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -109,6 +109,52 @@ TRACE_EVENT(mm_vmscan_direct_reclaim_end,
 	TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed)
 );
 
+TRACE_EVENT(mm_vmscan_lru_isolate,
+
+	TP_PROTO(int order,
+		unsigned long nr_requested,
+		unsigned long nr_scanned,
+		unsigned long nr_taken,
+		unsigned long nr_lumpy_taken,
+		unsigned long nr_lumpy_dirty,
+		unsigned long nr_lumpy_failed,
+		int isolate_mode),
+
+	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, isolate_mode),
+
+	TP_STRUCT__entry(
+		__field(int, order)
+		__field(unsigned long, nr_requested)
+		__field(unsigned long, nr_scanned)
+		__field(unsigned long, nr_taken)
+		__field(unsigned long, nr_lumpy_taken)
+		__field(unsigned long, nr_lumpy_dirty)
+		__field(unsigned long, nr_lumpy_failed)
+		__field(int, isolate_mode)
+	),
+
+	TP_fast_assign(
+		__entry->order = order;
+		__entry->nr_requested = nr_requested;
+		__entry->nr_scanned = nr_scanned;
+		__entry->nr_taken = nr_taken;
+		__entry->nr_lumpy_taken = nr_lumpy_taken;
+		__entry->nr_lumpy_dirty = nr_lumpy_dirty;
+		__entry->nr_lumpy_failed = nr_lumpy_failed;
+		__entry->isolate_mode = isolate_mode;
+	),
+
+	TP_printk("isolate_mode=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu contig_taken=%lu contig_dirty=%lu contig_failed=%lu",
+		__entry->isolate_mode,
+		__entry->order,
+		__entry->nr_requested,
+		__entry->nr_scanned,
+		__entry->nr_taken,
+		__entry->nr_lumpy_taken,
+		__entry->nr_lumpy_dirty,
+		__entry->nr_lumpy_failed)
+);
+		
 #endif /* _TRACE_VMSCAN_H */
 
 /* This part must be outside protection */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d425cef..095c66c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -917,6 +917,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		unsigned long *scanned, int order, int mode, int file)
 {
 	unsigned long nr_taken = 0;
+	unsigned long nr_lumpy_taken = 0, nr_lumpy_dirty = 0, nr_lumpy_failed = 0;
 	unsigned long scan;
 
 	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
@@ -994,12 +995,25 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 				list_move(&cursor_page->lru, dst);
 				mem_cgroup_del_lru(cursor_page);
 				nr_taken++;
+				nr_lumpy_taken++;
+				if (PageDirty(cursor_page))
+					nr_lumpy_dirty++;
 				scan++;
+			} else {
+				if (mode == ISOLATE_BOTH &&
+						page_count(cursor_page))
+					nr_lumpy_failed++;
 			}
 		}
 	}
 
 	*scanned = scan;
+
+	trace_mm_vmscan_lru_isolate(order,
+			nr_to_scan, scan,
+			nr_taken,
+			nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed,
+			mode);
 	return nr_taken;
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 04/14] tracing, vmscan: Add trace event when a page is written
  2010-06-29 11:34 ` Mel Gorman
@ 2010-06-29 11:34   ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

This patch adds a trace event for when page reclaim queues a page for IO and
records whether it is synchronous or asynchronous. Excessive synchronous
IO for a process can result in noticeable stalls during direct reclaim.
Excessive IO from page reclaim may indicate that the system is seriously
under provisioned for the amount of dirty pages that exist.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
---
 include/trace/events/vmscan.h |   23 +++++++++++++++++++++++
 mm/vmscan.c                   |    2 ++
 2 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index a331454..b26daa9 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -154,6 +154,29 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
 		__entry->nr_lumpy_dirty,
 		__entry->nr_lumpy_failed)
 );
+
+TRACE_EVENT(mm_vmscan_writepage,
+
+	TP_PROTO(struct page *page,
+		int sync_io),
+
+	TP_ARGS(page, sync_io),
+
+	TP_STRUCT__entry(
+		__field(struct page *, page)
+		__field(int, sync_io)
+	),
+
+	TP_fast_assign(
+		__entry->page = page;
+		__entry->sync_io = sync_io;
+	),
+
+	TP_printk("page=%p pfn=%lu sync_io=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->sync_io)
+);
 		
 #endif /* _TRACE_VMSCAN_H */
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 095c66c..20160c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -399,6 +399,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			/* synchronous write or broken a_ops? */
 			ClearPageReclaim(page);
 		}
+		trace_mm_vmscan_writepage(page,
+			sync_writeback == PAGEOUT_IO_SYNC);
 		inc_zone_page_state(page, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
 	}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 04/14] tracing, vmscan: Add trace event when a page is written
@ 2010-06-29 11:34   ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

This patch adds a trace event for when page reclaim queues a page for IO and
records whether it is synchronous or asynchronous. Excessive synchronous
IO for a process can result in noticeable stalls during direct reclaim.
Excessive IO from page reclaim may indicate that the system is seriously
under provisioned for the amount of dirty pages that exist.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
---
 include/trace/events/vmscan.h |   23 +++++++++++++++++++++++
 mm/vmscan.c                   |    2 ++
 2 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index a331454..b26daa9 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -154,6 +154,29 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
 		__entry->nr_lumpy_dirty,
 		__entry->nr_lumpy_failed)
 );
+
+TRACE_EVENT(mm_vmscan_writepage,
+
+	TP_PROTO(struct page *page,
+		int sync_io),
+
+	TP_ARGS(page, sync_io),
+
+	TP_STRUCT__entry(
+		__field(struct page *, page)
+		__field(int, sync_io)
+	),
+
+	TP_fast_assign(
+		__entry->page = page;
+		__entry->sync_io = sync_io;
+	),
+
+	TP_printk("page=%p pfn=%lu sync_io=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->sync_io)
+);
 		
 #endif /* _TRACE_VMSCAN_H */
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 095c66c..20160c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -399,6 +399,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			/* synchronous write or broken a_ops? */
 			ClearPageReclaim(page);
 		}
+		trace_mm_vmscan_writepage(page,
+			sync_writeback == PAGEOUT_IO_SYNC);
 		inc_zone_page_state(page, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
 	}
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 05/14] tracing, vmscan: Add a postprocessing script for reclaim-related ftrace events
  2010-06-29 11:34 ` Mel Gorman
@ 2010-06-29 11:34   ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

This patch adds a simple post-processing script for the reclaim-related
trace events.  It can be used to give an indication of how much traffic
there is on the LRU lists and how severe latencies due to reclaim are.
Example output looks like the following

Reclaim latencies expressed as order-latency_in_ms
uname-3942             9-200.179000000004 9-98.7900000000373 9-99.8330000001006
kswapd0-311            0-662.097999999998 0-2.79700000002049 \
	0-149.100000000035 0-3295.73600000003 0-9806.31799999997 0-35528.833 \
	0-10043.197 0-129740.979 0-3.50500000000466 0-3.54899999999907 \
	0-9297.78999999992 0-3.48499999998603 0-3596.97999999998 0-3.92799999995623 \
	0-3.35000000009313 0-16729.017 0-3.57799999997951 0-47435.0630000001 \
	0-3.7819999998901 0-5864.06999999995 0-18635.334 0-10541.289 9-186011.565 \
	9-3680.86300000001 9-1379.06499999994 9-958571.115 9-66215.474 \
	9-6721.14699999988 9-1962.15299999993 9-1094806.125 9-2267.83199999994 \
	9-47120.9029999999 9-427653.886 9-2.6359999999404 9-632.148999999976 \
	9-476.753000000026 9-495.577000000048 9-8.45900000003166 9-6.6820000000298 \
	9-1.30500000016764 9-251.746000000043 9-383.905000000028 9-80.1419999999925 \
	9-281.160000000149 9-14.8780000000261 9-381.45299999998 9-512.07799999998 \
	9-49.5519999999087 9-167.439000000013 9-183.820999999996 9-239.527999999933 \
	9-19.9479999998584 9-148.747999999905 9-164.583000000101 9-16.9480000000913 \
	9-192.376000000164 9-64.1010000000242 9-1.40800000005402 9-3.60800000000745 \
	9-17.1359999999404 9-4.69500000006519 9-2.06400000001304 9-1582488.554 \
	9-6244.19499999983 9-348153.812 9-2.0999999998603 9-0.987999999895692 \
	0-32218.473 0-1.6140000000596 0-1.28100000019185 0-1.41300000017509 \
	0-1.32299999985844 0-602.584000000032 0-1.34400000004098 0-1.6929999999702 \
	1-22101.8190000001 9-174876.724 9-16.2420000000857 9-175.165999999736 \
	9-15.8589999997057 9-0.604999999981374 9-3061.09000000032 9-479.277000000235 \
	9-1.54499999992549 9-771.985000000335 9-4.88700000010431 9-15.0649999999441 \
	9-0.879999999888241 9-252.01500000013 9-1381.03600000031 9-545.689999999944 \
	9-3438.0129999998 9-3343.70099999988
bench-stresshig-3942   9-7063.33900000004 9-129960.482 9-2062.27500000002 \
	9-3845.59399999992 9-171.82799999998 9-16493.821 9-7615.23900000006 \
	9-10217.848 9-983.138000000035 9-2698.39999999991 9-4016.1540000001 \
	9-5522.37700000009 9-21630.429 \
	9-15061.048 9-10327.953 9-542.69700000016 9-317.652000000002 \
	9-8554.71699999995 9-1786.61599999992 9-1899.31499999994 9-2093.41899999999 \
	9-4992.62400000007 9-942.648999999976 9-1923.98300000001 9-3.7980000001844 \
	9-5.99899999983609 9-0.912000000011176 9-1603.67700000014 9-1.98300000000745 \
	9-3.96500000008382 9-0.902999999932945 9-2802.72199999983 9-1078.24799999991 \
	9-2155.82900000014 9-10.058999999892 9-1984.723 9-1687.97999999998 \
	9-1136.05300000007 9-3183.61699999985 9-458.731000000145 9-6.48600000003353 \
	9-1013.25200000009 9-8415.22799999989 9-10065.584 9-2076.79600000009 \
	9-3792.65699999989 9-71.2010000001173 9-2560.96999999997 9-2260.68400000012 \
	9-2862.65799999982 9-1255.81500000018 9-15.7440000001807 9-4.33499999996275 \
	9-1446.63800000004 9-238.635000000009 9-60.1790000000037 9-4.38800000003539 \
	9-639.567000000039 9-306.698000000091 9-31.4070000001229 9-74.997999999905 \
	9-632.725999999791 9-1625.93200000003 9-931.266000000061 9-98.7749999999069 \
	9-984.606999999844 9-225.638999999966 9-421.316000000108 9-653.744999999879 \
	9-572.804000000004 9-769.158999999985 9-603.918000000063 9-4.28499999991618 \
	9-626.21399999992 9-1721.25 9-0.854999999981374 9-572.39599999995 \
	9-681.881999999983 9-1345.12599999993 9-363.666999999899 9-3823.31099999999 \
	9-2991.28200000012 9-4.27099999994971 9-309.76500000013 9-3068.35700000008 \
	9-788.25 9-3515.73999999999 9-2065.96100000013 9-286.719999999972 \
	9-316.076000000117 9-344.151000000071 9-2.51000000000931 9-306.688000000082 \
	9-1515.00099999993 9-336.528999999864 9-793.491999999853 9-457.348999999929 \
	9-13620.155 9-119.933999999892 9-35.0670000000391 9-918.266999999993 \
	9-828.569000000134 9-4863.81099999999 9-105.222000000067 9-894.23900000006 \
	9-110.964999999851 9-0.662999999942258 9-12753.3150000002 9-12.6129999998957 \
	9-13368.0899999999 9-12.4199999999255 9-1.00300000002608 9-1.41100000008009 \
	9-10300.5290000001 9-16.502000000095 9-30.7949999999255 9-6283.0140000002 \
	9-4320.53799999994 9-6826.27300000004 9-3.07299999985844 9-1497.26799999992 \
	9-13.4040000000969 9-3.12999999988824 9-3.86100000003353 9-11.3539999998175 \
	9-0.10799999977462 9-21.780999999959 9-209.695999999996 9-299.647000000114 \
	9-6.01699999999255 9-20.8349999999627 9-22.5470000000205 9-5470.16800000006 \
	9-7.60499999998137 9-0.821000000229105 9-1.56600000010803 9-14.1669999998994 \
	9-0.209000000031665 9-1.82300000009127 9-1.70000000018626 9-19.9429999999702 \
	9-124.266999999993 9-0.0389999998733401 9-6.71400000015274 9-16.7710000001825 \
	9-31.0409999999683 9-0.516999999992549 9-115.888000000035 9-5.19900000002235 \
	9-222.389999999898 9-11.2739999999758 9-80.9050000000279 9-8.14500000001863 \
	9-4.44599999999627 9-0.218999999808148 9-0.715000000083819 9-0.233000000007451
\
	9-48.2630000000354 9-248.560999999987 9-374.96800000011 9-644.179000000004 \
	9-0.835999999893829 9-79.0060000000522 9-128.447999999858 9-0.692000000039116 \
	9-5.26500000013039 9-128.449000000022 9-2.04799999995157 9-12.0990000001621 \
	9-8.39899999997579 9-10.3860000001732 9-11.9310000000987 9-53.4450000000652 \
	9-0.46999999997206 9-2.96299999998882 9-17.9699999999721 9-0.776000000070781 \
	9-25.2919999998994 9-33.1110000000335 9-0.434000000124797 9-0.641000000061467 \
	9-0.505000000121072 9-1.12800000002608 9-149.222000000067 9-1.17599999997765 \
	9-3247.33100000001 9-10.7439999999478 9-153.523000000045 9-1.38300000014715 \
	9-794.762000000104 9-3.36199999996461 9-128.765999999829 9-181.543999999994 \
	9-78149.8229999999 9-176.496999999974 9-89.9940000001807 9-9.12700000009499 \
	9-250.827000000048 9-0.224999999860302 9-0.388999999966472 9-1.16700000036508 \
	9-32.1740000001155 9-12.6800000001676 9-0.0720000001601875 9-0.274999999906868
\
	9-0.724000000394881 9-266.866000000387 9-45.5709999999963 9-4.54399999976158 \
	9-8.27199999988079 9-4.38099999958649 9-0.512000000104308 9-0.0640000002458692
\
	9-5.20000000018626 9-0.0839999997988343 9-12.816000000108 9-0.503000000026077 \
	9-0.507999999914318 9-6.23999999975786 9-3.35100000025705 9-18.8530000001192 \
	9-25.2220000000671 9-68.2309999996796 9-98.9939999999478 9-0.441000000108033 \
	9-4.24599999981001 9-261.702000000048 9-3.01599999982864 9-0.0749999997206032 \
	9-0.0370000000111759 9-4.375 9-3.21800000034273 9-11.3960000001825 \
	9-0.0540000000037253 9-0.286000000312924 9-0.865999999921769 \
	9-0.294999999925494 9-6.45999999996275 9-4.31099999975413 9-128.248999999836 \
	9-0.282999999821186 9-102.155000000261 9-0.0860000001266599 \
	9-0.0540000000037253 9-0.935000000055879 9-0.0670000002719462 \
	9-5.8640000000596 9-19.9860000000335 9-4.18699999991804 9-0.566000000108033 \
	9-2.55099999997765 9-0.702000000048429 9-131.653999999631 9-0.638999999966472 \
	9-14.3229999998584 9-183.398000000045 9-178.095999999903 9-3.22899999981746 \
	9-7.31399999978021 9-22.2400000002235 9-11.7979999999516 9-108.10599999968 \
	9-99.0159999998286 9-102.640999999829 9-38.414000000339
Process                  Direct     Wokeup      Pages      Pages    Pages
details                   Rclms     Kswapd    Scanned    Sync-IO ASync-IO
cc1-30800                     0          1          0          0        0      wakeup-0=1
cc1-24260                     0          1          0          0        0      wakeup-0=1
cc1-24152                     0         12          0          0        0      wakeup-0=12
cc1-8139                      0          1          0          0        0      wakeup-0=1
cc1-4390                      0          1          0          0        0      wakeup-0=1
cc1-4648                      0          7          0          0        0      wakeup-0=7
cc1-4552                      0          3          0          0        0      wakeup-0=3
dd-4550                       0         31          0          0        0      wakeup-0=31
date-4898                     0          1          0          0        0      wakeup-0=1
cc1-6549                      0          7          0          0        0      wakeup-0=7
as-22202                      0         17          0          0        0      wakeup-0=17
cc1-6495                      0          9          0          0        0      wakeup-0=9
cc1-8299                      0          1          0          0        0      wakeup-0=1
cc1-6009                      0          1          0          0        0      wakeup-0=1
cc1-2574                      0          2          0          0        0      wakeup-0=2
cc1-30568                     0          1          0          0        0      wakeup-0=1
cc1-2679                      0          6          0          0        0      wakeup-0=6
sh-13747                      0         12          0          0        0      wakeup-0=12
cc1-22193                     0         18          0          0        0      wakeup-0=18
cc1-30725                     0          2          0          0        0      wakeup-0=2
as-4392                       0          2          0          0        0      wakeup-0=2
cc1-28180                     0         14          0          0        0      wakeup-0=14
cc1-13697                     0          2          0          0        0      wakeup-0=2
cc1-22207                     0          8          0          0        0      wakeup-0=8
cc1-15270                     0        179          0          0        0      wakeup-0=179
cc1-22011                     0         82          0          0        0      wakeup-0=82
cp-14682                      0          1          0          0        0      wakeup-0=1
as-11926                      0          2          0          0        0      wakeup-0=2
cc1-6016                      0          5          0          0        0      wakeup-0=5
make-18554                    0         13          0          0        0      wakeup-0=13
cc1-8292                      0         12          0          0        0      wakeup-0=12
make-24381                    0          1          0          0        0      wakeup-1=1
date-18681                    0         33          0          0        0      wakeup-0=33
cc1-32276                     0          1          0          0        0      wakeup-0=1
timestamp-outpu-2809          0        253          0          0        0      wakeup-0=240 wakeup-1=13
date-18624                    0          7          0          0        0      wakeup-0=7
cc1-30960                     0          9          0          0        0      wakeup-0=9
cc1-4014                      0          1          0          0        0      wakeup-0=1
cc1-30706                     0         22          0          0        0      wakeup-0=22
uname-3942                    4          1        306          0       17      direct-9=4       wakeup-9=1
cc1-28207                     0          1          0          0        0      wakeup-0=1
cc1-30563                     0          9          0          0        0      wakeup-0=9
cc1-22214                     0         10          0          0        0      wakeup-0=10
cc1-28221                     0         11          0          0        0      wakeup-0=11
cc1-28123                     0          6          0          0        0      wakeup-0=6
kswapd0-311                   0          7     357302          0    34233      wakeup-0=7
cc1-5988                      0          7          0          0        0      wakeup-0=7
as-30734                      0        161          0          0        0      wakeup-0=161
cc1-22004                     0         45          0          0        0      wakeup-0=45
date-4590                     0          4          0          0        0      wakeup-0=4
cc1-15279                     0        213          0          0        0      wakeup-0=213
date-30735                    0          1          0          0        0      wakeup-0=1
cc1-30583                     0          4          0          0        0      wakeup-0=4
cc1-32324                     0          2          0          0        0      wakeup-0=2
cc1-23933                     0          3          0          0        0      wakeup-0=3
cc1-22001                     0         36          0          0        0      wakeup-0=36
bench-stresshig-3942        287        287      80186       6295    12196      direct-9=287       wakeup-9=287
cc1-28170                     0          7          0          0        0      wakeup-0=7
date-7932                     0         92          0          0        0      wakeup-0=92
cc1-22222                     0          6          0          0        0      wakeup-0=6
cc1-32334                     0         16          0          0        0      wakeup-0=16
cc1-2690                      0          6          0          0        0      wakeup-0=6
cc1-30733                     0          9          0          0        0      wakeup-0=9
cc1-32298                     0          2          0          0        0      wakeup-0=2
cc1-13743                     0         18          0          0        0      wakeup-0=18
cc1-22186                     0          4          0          0        0      wakeup-0=4
cc1-28214                     0         11          0          0        0      wakeup-0=11
cc1-13735                     0          1          0          0        0      wakeup-0=1
updatedb-8173                 0         18          0          0        0      wakeup-0=18
cc1-13750                     0          3          0          0        0      wakeup-0=3
cat-2808                      0          2          0          0        0      wakeup-0=2
cc1-15277                     0        169          0          0        0      wakeup-0=169
date-18317                    0          1          0          0        0      wakeup-0=1
cc1-15274                     0        197          0          0        0      wakeup-0=197
cc1-30732                     0          1          0          0        0      wakeup-0=1

Kswapd                   Kswapd      Order      Pages      Pages    Pages
Instance                Wakeups  Re-wakeup    Scanned    Sync-IO ASync-IO
kswapd0-311                  91         24     357302          0    34233      wake-0=31 wake-1=1 wake-9=59       rewake-0=10 rewake-1=1 rewake-9=13

Summary
Direct reclaims:     		291
Direct reclaim pages scanned:	437794
Direct reclaim write sync I/O:	6295
Direct reclaim write async I/O:	46446
Wake kswapd requests:		2152
Time stalled direct reclaim: 	519.163009000002 ms

Kswapd wakeups:			91
Kswapd pages scanned:		357302
Kswapd reclaim write sync I/O:	0
Kswapd reclaim write async I/O:	34233
Time kswapd awake:		5282.749757 ms

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
---
 .../trace/postprocess/trace-vmscan-postprocess.pl  |  654 ++++++++++++++++++++
 1 files changed, 654 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/postprocess/trace-vmscan-postprocess.pl

diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
new file mode 100644
index 0000000..b48d968
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -0,0 +1,654 @@
+#!/usr/bin/perl
+# This is a POC for reading the text representation of trace output related to
+# page reclaim. It makes an attempt to extract some high-level information on
+# what is going on. The accuracy of the parser may vary
+#
+# Example usage: trace-vmscan-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
+# other options
+#   --read-procstat	If the trace lacks process info, get it from /proc
+#   --ignore-pid	Aggregate processes of the same name together
+#
+# Copyright (c) IBM Corporation 2009
+# Author: Mel Gorman <mel@csn.ul.ie>
+use strict;
+use Getopt::Long;
+
+# Tracepoint events
+use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN	=> 1;
+use constant MM_VMSCAN_DIRECT_RECLAIM_END	=> 2;
+use constant MM_VMSCAN_KSWAPD_WAKE		=> 3;
+use constant MM_VMSCAN_KSWAPD_SLEEP		=> 4;
+use constant MM_VMSCAN_LRU_SHRINK_ACTIVE	=> 5;
+use constant MM_VMSCAN_LRU_SHRINK_INACTIVE	=> 6;
+use constant MM_VMSCAN_LRU_ISOLATE		=> 7;
+use constant MM_VMSCAN_WRITEPAGE_SYNC		=> 8;
+use constant MM_VMSCAN_WRITEPAGE_ASYNC		=> 9;
+use constant EVENT_UNKNOWN			=> 10;
+
+# Per-order events
+use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER => 11;
+use constant MM_VMSCAN_WAKEUP_KSWAPD_PERORDER 	=> 12;
+use constant MM_VMSCAN_KSWAPD_WAKE_PERORDER	=> 13;
+use constant HIGH_KSWAPD_REWAKEUP_PERORDER	=> 14;
+
+# Constants used to track state
+use constant STATE_DIRECT_BEGIN 		=> 15;
+use constant STATE_DIRECT_ORDER 		=> 16;
+use constant STATE_KSWAPD_BEGIN			=> 17;
+use constant STATE_KSWAPD_ORDER			=> 18;
+
+# High-level events extrapolated from tracepoints
+use constant HIGH_DIRECT_RECLAIM_LATENCY	=> 19;
+use constant HIGH_KSWAPD_LATENCY		=> 20;
+use constant HIGH_KSWAPD_REWAKEUP		=> 21;
+use constant HIGH_NR_SCANNED			=> 22;
+use constant HIGH_NR_TAKEN			=> 23;
+use constant HIGH_NR_RECLAIM			=> 24;
+use constant HIGH_NR_CONTIG_DIRTY		=> 25;
+
+my %perprocesspid;
+my %perprocess;
+my %last_procmap;
+my $opt_ignorepid;
+my $opt_read_procstat;
+
+my $total_wakeup_kswapd;
+my ($total_direct_reclaim, $total_direct_nr_scanned);
+my ($total_direct_latency, $total_kswapd_latency);
+my ($total_direct_writepage_sync, $total_direct_writepage_async);
+my ($total_kswapd_nr_scanned, $total_kswapd_wake);
+my ($total_kswapd_writepage_sync, $total_kswapd_writepage_async);
+
+# Catch sigint and exit on request
+my $sigint_report = 0;
+my $sigint_exit = 0;
+my $sigint_pending = 0;
+my $sigint_received = 0;
+sub sigint_handler {
+	my $current_time = time;
+	if ($current_time - 2 > $sigint_received) {
+		print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
+		$sigint_report = 1;
+	} else {
+		if (!$sigint_exit) {
+			print "Second SIGINT received quickly, exiting\n";
+		}
+		$sigint_exit++;
+	}
+
+	if ($sigint_exit > 3) {
+		print "Many SIGINTs received, exiting now without report\n";
+		exit;
+	}
+
+	$sigint_received = $current_time;
+	$sigint_pending = 1;
+}
+$SIG{INT} = "sigint_handler";
+
+# Parse command line options
+GetOptions(
+	'ignore-pid'	 =>	\$opt_ignorepid,
+	'read-procstat'	 =>	\$opt_read_procstat,
+);
+
+# Defaults for dynamically discovered regex's
+my $regex_direct_begin_default = 'order=([0-9]*) may_writepage=([0-9]*) gfp_flags=([A-Z_|]*)';
+my $regex_direct_end_default = 'nr_reclaimed=([0-9]*)';
+my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)';
+my $regex_kswapd_sleep_default = 'nid=([0-9]*)';
+my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
+my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)';
+my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)';
+my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
+my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) sync_io=([0-9]*)';
+
+# Dyanically discovered regex
+my $regex_direct_begin;
+my $regex_direct_end;
+my $regex_kswapd_wake;
+my $regex_kswapd_sleep;
+my $regex_wakeup_kswapd;
+my $regex_lru_isolate;
+my $regex_lru_shrink_inactive;
+my $regex_lru_shrink_active;
+my $regex_writepage;
+
+# Static regex used. Specified like this for readability and for use with /o
+#                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
+my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
+my $regex_statname = '[-0-9]*\s\((.*)\).*';
+my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
+
+sub generate_traceevent_regex {
+	my $event = shift;
+	my $default = shift;
+	my $regex;
+
+	# Read the event format or use the default
+	if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
+		print("WARNING: Event $event format string not found\n");
+		return $default;
+	} else {
+		my $line;
+		while (!eof(FORMAT)) {
+			$line = <FORMAT>;
+			$line =~ s/, REC->.*//;
+			if ($line =~ /^print fmt:\s"(.*)".*/) {
+				$regex = $1;
+				$regex =~ s/%s/\([0-9a-zA-Z|_]*\)/g;
+				$regex =~ s/%p/\([0-9a-f]*\)/g;
+				$regex =~ s/%d/\([-0-9]*\)/g;
+				$regex =~ s/%ld/\([-0-9]*\)/g;
+				$regex =~ s/%lu/\([0-9]*\)/g;
+			}
+		}
+	}
+
+	# Can't handle the print_flags stuff but in the context of this
+	# script, it really doesn't matter
+	$regex =~ s/\(REC.*\) \? __print_flags.*//;
+
+	# Verify fields are in the right order
+	my $tuple;
+	foreach $tuple (split /\s/, $regex) {
+		my ($key, $value) = split(/=/, $tuple);
+		my $expected = shift;
+		if ($key ne $expected) {
+			print("WARNING: Format not as expected for event $event '$key' != '$expected'\n");
+			$regex =~ s/$key=\((.*)\)/$key=$1/;
+		}
+	}
+
+	if (defined shift) {
+		die("Fewer fields than expected in format");
+	}
+
+	return $regex;
+}
+
+$regex_direct_begin = generate_traceevent_regex(
+			"vmscan/mm_vmscan_direct_reclaim_begin",
+			$regex_direct_begin_default,
+			"order", "may_writepage",
+			"gfp_flags");
+$regex_direct_end = generate_traceevent_regex(
+			"vmscan/mm_vmscan_direct_reclaim_end",
+			$regex_direct_end_default,
+			"nr_reclaimed");
+$regex_kswapd_wake = generate_traceevent_regex(
+			"vmscan/mm_vmscan_kswapd_wake",
+			$regex_kswapd_wake_default,
+			"nid", "order");
+$regex_kswapd_sleep = generate_traceevent_regex(
+			"vmscan/mm_vmscan_kswapd_sleep",
+			$regex_kswapd_sleep_default,
+			"nid");
+$regex_wakeup_kswapd = generate_traceevent_regex(
+			"vmscan/mm_vmscan_wakeup_kswapd",
+			$regex_wakeup_kswapd_default,
+			"nid", "zid", "order");
+$regex_lru_isolate = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_isolate",
+			$regex_lru_isolate_default,
+			"isolate_mode", "order",
+			"nr_requested", "nr_scanned", "nr_taken",
+			"contig_taken", "contig_dirty", "contig_failed");
+$regex_lru_shrink_inactive = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_shrink_inactive",
+			$regex_lru_shrink_inactive_default,
+			"nid", "zid",
+			"lru",
+			"nr_scanned", "nr_reclaimed", "priority");
+$regex_lru_shrink_active = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_shrink_active",
+			$regex_lru_shrink_active_default,
+			"nid", "zid",
+			"lru",
+			"nr_scanned", "nr_rotated", "priority");
+$regex_writepage = generate_traceevent_regex(
+			"vmscan/mm_vmscan_writepage",
+			$regex_writepage_default,
+			"page", "pfn", "sync_io");
+
+sub read_statline($) {
+	my $pid = $_[0];
+	my $statline;
+
+	if (open(STAT, "/proc/$pid/stat")) {
+		$statline = <STAT>;
+		close(STAT);
+	}
+
+	if ($statline eq '') {
+		$statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
+	}
+
+	return $statline;
+}
+
+sub guess_process_pid($$) {
+	my $pid = $_[0];
+	my $statline = $_[1];
+
+	if ($pid == 0) {
+		return "swapper-0";
+	}
+
+	if ($statline !~ /$regex_statname/o) {
+		die("Failed to math stat line for process name :: $statline");
+	}
+	return "$1-$pid";
+}
+
+# Convert sec.usec timestamp format
+sub timestamp_to_ms($) {
+	my $timestamp = $_[0];
+
+	my ($sec, $usec) = split (/\./, $timestamp);
+	return ($sec * 1000) + ($usec / 1000);
+}
+
+sub process_events {
+	my $traceevent;
+	my $process_pid;
+	my $cpus;
+	my $timestamp;
+	my $tracepoint;
+	my $details;
+	my $statline;
+
+	# Read each line of the event log
+EVENT_PROCESS:
+	while ($traceevent = <STDIN>) {
+		if ($traceevent =~ /$regex_traceevent/o) {
+			$process_pid = $1;
+			$timestamp = $3;
+			$tracepoint = $4;
+
+			$process_pid =~ /(.*)-([0-9]*)$/;
+			my $process = $1;
+			my $pid = $2;
+
+			if ($process eq "") {
+				$process = $last_procmap{$pid};
+				$process_pid = "$process-$pid";
+			}
+			$last_procmap{$pid} = $process;
+
+			if ($opt_read_procstat) {
+				$statline = read_statline($pid);
+				if ($opt_read_procstat && $process eq '') {
+					$process_pid = guess_process_pid($pid, $statline);
+				}
+			}
+		} else {
+			next;
+		}
+
+		# Perl Switch() sucks majorly
+		if ($tracepoint eq "mm_vmscan_direct_reclaim_begin") {
+			$timestamp = timestamp_to_ms($timestamp);
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}++;
+			$perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN} = $timestamp;
+
+			$details = $5;
+			if ($details !~ /$regex_direct_begin/o) {
+				print "WARNING: Failed to parse mm_vmscan_direct_reclaim_begin as expected\n";
+				print "         $details\n";
+				print "         $regex_direct_begin\n";
+				next;
+			}
+			my $order = $1;
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]++;
+			$perprocesspid{$process_pid}->{STATE_DIRECT_ORDER} = $order;
+		} elsif ($tracepoint eq "mm_vmscan_direct_reclaim_end") {
+			# Count the event itself
+			my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END};
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END}++;
+
+			# Record how long direct reclaim took this time
+			if (defined $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN}) {
+				$timestamp = timestamp_to_ms($timestamp);
+				my $order = $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER};
+				my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN});
+				$perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] = "$order-$latency";
+			}
+		} elsif ($tracepoint eq "mm_vmscan_kswapd_wake") {
+			$details = $5;
+			if ($details !~ /$regex_kswapd_wake/o) {
+				print "WARNING: Failed to parse mm_vmscan_kswapd_wake as expected\n";
+				print "         $details\n";
+				print "         $regex_kswapd_wake\n";
+				next;
+			}
+
+			my $order = $2;
+			$perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER} = $order;
+			if (!$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN}) {
+				$timestamp = timestamp_to_ms($timestamp);
+				$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}++;
+				$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = $timestamp;
+				$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]++;
+			} else {
+				$perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}++;
+				$perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order]++;
+			}
+		} elsif ($tracepoint eq "mm_vmscan_kswapd_sleep") {
+
+			# Count the event itself
+			my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP};
+			$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP}++;
+
+			# Record how long kswapd was awake
+			$timestamp = timestamp_to_ms($timestamp);
+			my $order = $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER};
+			my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN});
+			$perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index] = "$order-$latency";
+			$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = 0;
+		} elsif ($tracepoint eq "mm_vmscan_wakeup_kswapd") {
+			$perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}++;
+
+			$details = $5;
+			if ($details !~ /$regex_wakeup_kswapd/o) {
+				print "WARNING: Failed to parse mm_vmscan_wakeup_kswapd as expected\n";
+				print "         $details\n";
+				print "         $regex_wakeup_kswapd\n";
+				next;
+			}
+			my $order = $3;
+			$perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]++;
+		} elsif ($tracepoint eq "mm_vmscan_lru_isolate") {
+			$details = $5;
+			if ($details !~ /$regex_lru_isolate/o) {
+				print "WARNING: Failed to parse mm_vmscan_lru_isolate as expected\n";
+				print "         $details\n";
+				print "         $regex_lru_isolate/o\n";
+				next;
+			}
+			my $nr_scanned = $4;
+			my $nr_contig_dirty = $7;
+			$perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
+			$perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty;
+		} elsif ($tracepoint eq "mm_vmscan_writepage") {
+			$details = $5;
+			if ($details !~ /$regex_writepage/o) {
+				print "WARNING: Failed to parse mm_vmscan_writepage as expected\n";
+				print "         $details\n";
+				print "         $regex_writepage\n";
+				next;
+			}
+
+			my $sync_io = $3;
+			if ($sync_io) {
+				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}++;
+			} else {
+				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}++;
+			}
+		} else {
+			$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
+		}
+
+		if ($sigint_pending) {
+			last EVENT_PROCESS;
+		}
+	}
+}
+
+sub dump_stats {
+	my $hashref = shift;
+	my %stats = %$hashref;
+
+	# Dump per-process stats
+	my $process_pid;
+	my $max_strlen = 0;
+
+	# Get the maximum process name
+	foreach $process_pid (keys %perprocesspid) {
+		my $len = length($process_pid);
+		if ($len > $max_strlen) {
+			$max_strlen = $len;
+		}
+	}
+	$max_strlen += 2;
+
+	# Work out latencies
+	printf("\n") if !$opt_ignorepid;
+	printf("Reclaim latencies expressed as order-latency_in_ms\n") if !$opt_ignorepid;
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[0] &&
+				!$stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[0]) {
+			next;
+		}
+
+		printf "%-" . $max_strlen . "s ", $process_pid if !$opt_ignorepid;
+		my $index = 0;
+		while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] ||
+			defined $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) {
+
+			if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) { 
+				printf("%s ", $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) if !$opt_ignorepid;
+				my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
+				$total_direct_latency += $latency;
+			} else {
+				printf("%s ", $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) if !$opt_ignorepid;
+				my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]);
+				$total_kswapd_latency += $latency;
+			}
+			$index++;
+		}
+		print "\n" if !$opt_ignorepid;
+	}
+
+	# Print out process activity
+	printf("\n");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "Process", "Direct",  "Wokeup", "Pages",   "Pages",   "Pages",     "Time");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "details", "Rclms",   "Kswapd", "Scanned", "Sync-IO", "ASync-IO",  "Stalled");
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
+			next;
+		}
+
+		$total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
+		$total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
+		$total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_direct_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$total_direct_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		my $index = 0;
+		my $this_reclaim_delay = 0;
+		while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
+			 my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
+			$this_reclaim_delay += $latency;
+			$index++;
+		}
+
+		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8u %8u %8.3f",
+			$process_pid,
+			$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
+			$stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
+			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC},
+			$this_reclaim_delay / 1000);
+
+		if ($stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
+				if ($count != 0) {
+					print "direct-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
+				if ($count != 0) {
+					print "wakeup-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY}) {
+			print "      ";
+			my $count = $stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY};
+			if ($count != 0) {
+				print "contig-dirty=$count ";
+			}
+		}
+
+		print "\n";
+	}
+
+	# Print out kswapd activity
+	printf("\n");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Kswapd",   "Kswapd",  "Order",     "Pages",   "Pages",  "Pages");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Sync-IO", "ASync-IO");
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
+			next;
+		}
+
+		$total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
+		$total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_kswapd_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$total_kswapd_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8i %8u",
+			$process_pid,
+			$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
+			$stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
+			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC});
+
+		if ($stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
+				if ($count != 0) {
+					print "wake-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order];
+				if ($count != 0) {
+					print "rewake-$order=$count ";
+				}
+			}
+		}
+		printf("\n");
+	}
+
+	# Print out summaries
+	$total_direct_latency /= 1000;
+	$total_kswapd_latency /= 1000;
+	print "\nSummary\n";
+	print "Direct reclaims:     		$total_direct_reclaim\n";
+	print "Direct reclaim pages scanned:	$total_direct_nr_scanned\n";
+	print "Direct reclaim write sync I/O:	$total_direct_writepage_sync\n";
+	print "Direct reclaim write async I/O:	$total_direct_writepage_async\n";
+	print "Wake kswapd requests:		$total_wakeup_kswapd\n";
+	printf "Time stalled direct reclaim: 	%-1.2f ms\n", $total_direct_latency;
+	print "\n";
+	print "Kswapd wakeups:			$total_kswapd_wake\n";
+	print "Kswapd pages scanned:		$total_kswapd_nr_scanned\n";
+	print "Kswapd reclaim write sync I/O:	$total_kswapd_writepage_sync\n";
+	print "Kswapd reclaim write async I/O:	$total_kswapd_writepage_async\n";
+	printf "Time kswapd awake:		%-1.2f ms\n", $total_kswapd_latency;
+}
+
+sub aggregate_perprocesspid() {
+	my $process_pid;
+	my $process;
+	undef %perprocess;
+
+	foreach $process_pid (keys %perprocesspid) {
+		$process = $process_pid;
+		$process =~ s/-([0-9])*$//;
+		if ($process eq '') {
+			$process = "NO_PROCESS_NAME";
+		}
+
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN} += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
+		$perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE} += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
+		$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
+		$perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
+		$perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		for (my $order = 0; $order < 20; $order++) {
+			$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
+			$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
+			$perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
+
+		}
+
+		# Aggregate direct reclaim latencies
+		my $wr_index = $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END};
+		my $rd_index = 0;
+		while (defined $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index]) {
+			$perprocess{$process}->{HIGH_DIRECT_RECLAIM_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index];
+			$rd_index++;
+			$wr_index++;
+		}
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
+
+		# Aggregate kswapd latencies
+		my $wr_index = $perprocess{$process}->{MM_VMSCAN_KSWAPD_SLEEP};
+		my $rd_index = 0;
+		while (defined $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index]) {
+			$perprocess{$process}->{HIGH_KSWAPD_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index];
+			$rd_index++;
+			$wr_index++;
+		}
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
+	}
+}
+
+sub report() {
+	if (!$opt_ignorepid) {
+		dump_stats(\%perprocesspid);
+	} else {
+		aggregate_perprocesspid();
+		dump_stats(\%perprocess);
+	}
+}
+
+# Process events or signals until neither is available
+sub signal_loop() {
+	my $sigint_processed;
+	do {
+		$sigint_processed = 0;
+		process_events();
+
+		# Handle pending signals if any
+		if ($sigint_pending) {
+			my $current_time = time;
+
+			if ($sigint_exit) {
+				print "Received exit signal\n";
+				$sigint_pending = 0;
+			}
+			if ($sigint_report) {
+				if ($current_time >= $sigint_received + 2) {
+					report();
+					$sigint_report = 0;
+					$sigint_pending = 0;
+					$sigint_processed = 1;
+				}
+			}
+		}
+	} while ($sigint_pending || $sigint_processed);
+}
+
+signal_loop();
+report();
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 05/14] tracing, vmscan: Add a postprocessing script for reclaim-related ftrace events
@ 2010-06-29 11:34   ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

This patch adds a simple post-processing script for the reclaim-related
trace events.  It can be used to give an indication of how much traffic
there is on the LRU lists and how severe latencies due to reclaim are.
Example output looks like the following

Reclaim latencies expressed as order-latency_in_ms
uname-3942             9-200.179000000004 9-98.7900000000373 9-99.8330000001006
kswapd0-311            0-662.097999999998 0-2.79700000002049 \
	0-149.100000000035 0-3295.73600000003 0-9806.31799999997 0-35528.833 \
	0-10043.197 0-129740.979 0-3.50500000000466 0-3.54899999999907 \
	0-9297.78999999992 0-3.48499999998603 0-3596.97999999998 0-3.92799999995623 \
	0-3.35000000009313 0-16729.017 0-3.57799999997951 0-47435.0630000001 \
	0-3.7819999998901 0-5864.06999999995 0-18635.334 0-10541.289 9-186011.565 \
	9-3680.86300000001 9-1379.06499999994 9-958571.115 9-66215.474 \
	9-6721.14699999988 9-1962.15299999993 9-1094806.125 9-2267.83199999994 \
	9-47120.9029999999 9-427653.886 9-2.6359999999404 9-632.148999999976 \
	9-476.753000000026 9-495.577000000048 9-8.45900000003166 9-6.6820000000298 \
	9-1.30500000016764 9-251.746000000043 9-383.905000000028 9-80.1419999999925 \
	9-281.160000000149 9-14.8780000000261 9-381.45299999998 9-512.07799999998 \
	9-49.5519999999087 9-167.439000000013 9-183.820999999996 9-239.527999999933 \
	9-19.9479999998584 9-148.747999999905 9-164.583000000101 9-16.9480000000913 \
	9-192.376000000164 9-64.1010000000242 9-1.40800000005402 9-3.60800000000745 \
	9-17.1359999999404 9-4.69500000006519 9-2.06400000001304 9-1582488.554 \
	9-6244.19499999983 9-348153.812 9-2.0999999998603 9-0.987999999895692 \
	0-32218.473 0-1.6140000000596 0-1.28100000019185 0-1.41300000017509 \
	0-1.32299999985844 0-602.584000000032 0-1.34400000004098 0-1.6929999999702 \
	1-22101.8190000001 9-174876.724 9-16.2420000000857 9-175.165999999736 \
	9-15.8589999997057 9-0.604999999981374 9-3061.09000000032 9-479.277000000235 \
	9-1.54499999992549 9-771.985000000335 9-4.88700000010431 9-15.0649999999441 \
	9-0.879999999888241 9-252.01500000013 9-1381.03600000031 9-545.689999999944 \
	9-3438.0129999998 9-3343.70099999988
bench-stresshig-3942   9-7063.33900000004 9-129960.482 9-2062.27500000002 \
	9-3845.59399999992 9-171.82799999998 9-16493.821 9-7615.23900000006 \
	9-10217.848 9-983.138000000035 9-2698.39999999991 9-4016.1540000001 \
	9-5522.37700000009 9-21630.429 \
	9-15061.048 9-10327.953 9-542.69700000016 9-317.652000000002 \
	9-8554.71699999995 9-1786.61599999992 9-1899.31499999994 9-2093.41899999999 \
	9-4992.62400000007 9-942.648999999976 9-1923.98300000001 9-3.7980000001844 \
	9-5.99899999983609 9-0.912000000011176 9-1603.67700000014 9-1.98300000000745 \
	9-3.96500000008382 9-0.902999999932945 9-2802.72199999983 9-1078.24799999991 \
	9-2155.82900000014 9-10.058999999892 9-1984.723 9-1687.97999999998 \
	9-1136.05300000007 9-3183.61699999985 9-458.731000000145 9-6.48600000003353 \
	9-1013.25200000009 9-8415.22799999989 9-10065.584 9-2076.79600000009 \
	9-3792.65699999989 9-71.2010000001173 9-2560.96999999997 9-2260.68400000012 \
	9-2862.65799999982 9-1255.81500000018 9-15.7440000001807 9-4.33499999996275 \
	9-1446.63800000004 9-238.635000000009 9-60.1790000000037 9-4.38800000003539 \
	9-639.567000000039 9-306.698000000091 9-31.4070000001229 9-74.997999999905 \
	9-632.725999999791 9-1625.93200000003 9-931.266000000061 9-98.7749999999069 \
	9-984.606999999844 9-225.638999999966 9-421.316000000108 9-653.744999999879 \
	9-572.804000000004 9-769.158999999985 9-603.918000000063 9-4.28499999991618 \
	9-626.21399999992 9-1721.25 9-0.854999999981374 9-572.39599999995 \
	9-681.881999999983 9-1345.12599999993 9-363.666999999899 9-3823.31099999999 \
	9-2991.28200000012 9-4.27099999994971 9-309.76500000013 9-3068.35700000008 \
	9-788.25 9-3515.73999999999 9-2065.96100000013 9-286.719999999972 \
	9-316.076000000117 9-344.151000000071 9-2.51000000000931 9-306.688000000082 \
	9-1515.00099999993 9-336.528999999864 9-793.491999999853 9-457.348999999929 \
	9-13620.155 9-119.933999999892 9-35.0670000000391 9-918.266999999993 \
	9-828.569000000134 9-4863.81099999999 9-105.222000000067 9-894.23900000006 \
	9-110.964999999851 9-0.662999999942258 9-12753.3150000002 9-12.6129999998957 \
	9-13368.0899999999 9-12.4199999999255 9-1.00300000002608 9-1.41100000008009 \
	9-10300.5290000001 9-16.502000000095 9-30.7949999999255 9-6283.0140000002 \
	9-4320.53799999994 9-6826.27300000004 9-3.07299999985844 9-1497.26799999992 \
	9-13.4040000000969 9-3.12999999988824 9-3.86100000003353 9-11.3539999998175 \
	9-0.10799999977462 9-21.780999999959 9-209.695999999996 9-299.647000000114 \
	9-6.01699999999255 9-20.8349999999627 9-22.5470000000205 9-5470.16800000006 \
	9-7.60499999998137 9-0.821000000229105 9-1.56600000010803 9-14.1669999998994 \
	9-0.209000000031665 9-1.82300000009127 9-1.70000000018626 9-19.9429999999702 \
	9-124.266999999993 9-0.0389999998733401 9-6.71400000015274 9-16.7710000001825 \
	9-31.0409999999683 9-0.516999999992549 9-115.888000000035 9-5.19900000002235 \
	9-222.389999999898 9-11.2739999999758 9-80.9050000000279 9-8.14500000001863 \
	9-4.44599999999627 9-0.218999999808148 9-0.715000000083819 9-0.233000000007451
\
	9-48.2630000000354 9-248.560999999987 9-374.96800000011 9-644.179000000004 \
	9-0.835999999893829 9-79.0060000000522 9-128.447999999858 9-0.692000000039116 \
	9-5.26500000013039 9-128.449000000022 9-2.04799999995157 9-12.0990000001621 \
	9-8.39899999997579 9-10.3860000001732 9-11.9310000000987 9-53.4450000000652 \
	9-0.46999999997206 9-2.96299999998882 9-17.9699999999721 9-0.776000000070781 \
	9-25.2919999998994 9-33.1110000000335 9-0.434000000124797 9-0.641000000061467 \
	9-0.505000000121072 9-1.12800000002608 9-149.222000000067 9-1.17599999997765 \
	9-3247.33100000001 9-10.7439999999478 9-153.523000000045 9-1.38300000014715 \
	9-794.762000000104 9-3.36199999996461 9-128.765999999829 9-181.543999999994 \
	9-78149.8229999999 9-176.496999999974 9-89.9940000001807 9-9.12700000009499 \
	9-250.827000000048 9-0.224999999860302 9-0.388999999966472 9-1.16700000036508 \
	9-32.1740000001155 9-12.6800000001676 9-0.0720000001601875 9-0.274999999906868
\
	9-0.724000000394881 9-266.866000000387 9-45.5709999999963 9-4.54399999976158 \
	9-8.27199999988079 9-4.38099999958649 9-0.512000000104308 9-0.0640000002458692
\
	9-5.20000000018626 9-0.0839999997988343 9-12.816000000108 9-0.503000000026077 \
	9-0.507999999914318 9-6.23999999975786 9-3.35100000025705 9-18.8530000001192 \
	9-25.2220000000671 9-68.2309999996796 9-98.9939999999478 9-0.441000000108033 \
	9-4.24599999981001 9-261.702000000048 9-3.01599999982864 9-0.0749999997206032 \
	9-0.0370000000111759 9-4.375 9-3.21800000034273 9-11.3960000001825 \
	9-0.0540000000037253 9-0.286000000312924 9-0.865999999921769 \
	9-0.294999999925494 9-6.45999999996275 9-4.31099999975413 9-128.248999999836 \
	9-0.282999999821186 9-102.155000000261 9-0.0860000001266599 \
	9-0.0540000000037253 9-0.935000000055879 9-0.0670000002719462 \
	9-5.8640000000596 9-19.9860000000335 9-4.18699999991804 9-0.566000000108033 \
	9-2.55099999997765 9-0.702000000048429 9-131.653999999631 9-0.638999999966472 \
	9-14.3229999998584 9-183.398000000045 9-178.095999999903 9-3.22899999981746 \
	9-7.31399999978021 9-22.2400000002235 9-11.7979999999516 9-108.10599999968 \
	9-99.0159999998286 9-102.640999999829 9-38.414000000339
Process                  Direct     Wokeup      Pages      Pages    Pages
details                   Rclms     Kswapd    Scanned    Sync-IO ASync-IO
cc1-30800                     0          1          0          0        0      wakeup-0=1
cc1-24260                     0          1          0          0        0      wakeup-0=1
cc1-24152                     0         12          0          0        0      wakeup-0=12
cc1-8139                      0          1          0          0        0      wakeup-0=1
cc1-4390                      0          1          0          0        0      wakeup-0=1
cc1-4648                      0          7          0          0        0      wakeup-0=7
cc1-4552                      0          3          0          0        0      wakeup-0=3
dd-4550                       0         31          0          0        0      wakeup-0=31
date-4898                     0          1          0          0        0      wakeup-0=1
cc1-6549                      0          7          0          0        0      wakeup-0=7
as-22202                      0         17          0          0        0      wakeup-0=17
cc1-6495                      0          9          0          0        0      wakeup-0=9
cc1-8299                      0          1          0          0        0      wakeup-0=1
cc1-6009                      0          1          0          0        0      wakeup-0=1
cc1-2574                      0          2          0          0        0      wakeup-0=2
cc1-30568                     0          1          0          0        0      wakeup-0=1
cc1-2679                      0          6          0          0        0      wakeup-0=6
sh-13747                      0         12          0          0        0      wakeup-0=12
cc1-22193                     0         18          0          0        0      wakeup-0=18
cc1-30725                     0          2          0          0        0      wakeup-0=2
as-4392                       0          2          0          0        0      wakeup-0=2
cc1-28180                     0         14          0          0        0      wakeup-0=14
cc1-13697                     0          2          0          0        0      wakeup-0=2
cc1-22207                     0          8          0          0        0      wakeup-0=8
cc1-15270                     0        179          0          0        0      wakeup-0=179
cc1-22011                     0         82          0          0        0      wakeup-0=82
cp-14682                      0          1          0          0        0      wakeup-0=1
as-11926                      0          2          0          0        0      wakeup-0=2
cc1-6016                      0          5          0          0        0      wakeup-0=5
make-18554                    0         13          0          0        0      wakeup-0=13
cc1-8292                      0         12          0          0        0      wakeup-0=12
make-24381                    0          1          0          0        0      wakeup-1=1
date-18681                    0         33          0          0        0      wakeup-0=33
cc1-32276                     0          1          0          0        0      wakeup-0=1
timestamp-outpu-2809          0        253          0          0        0      wakeup-0=240 wakeup-1=13
date-18624                    0          7          0          0        0      wakeup-0=7
cc1-30960                     0          9          0          0        0      wakeup-0=9
cc1-4014                      0          1          0          0        0      wakeup-0=1
cc1-30706                     0         22          0          0        0      wakeup-0=22
uname-3942                    4          1        306          0       17      direct-9=4       wakeup-9=1
cc1-28207                     0          1          0          0        0      wakeup-0=1
cc1-30563                     0          9          0          0        0      wakeup-0=9
cc1-22214                     0         10          0          0        0      wakeup-0=10
cc1-28221                     0         11          0          0        0      wakeup-0=11
cc1-28123                     0          6          0          0        0      wakeup-0=6
kswapd0-311                   0          7     357302          0    34233      wakeup-0=7
cc1-5988                      0          7          0          0        0      wakeup-0=7
as-30734                      0        161          0          0        0      wakeup-0=161
cc1-22004                     0         45          0          0        0      wakeup-0=45
date-4590                     0          4          0          0        0      wakeup-0=4
cc1-15279                     0        213          0          0        0      wakeup-0=213
date-30735                    0          1          0          0        0      wakeup-0=1
cc1-30583                     0          4          0          0        0      wakeup-0=4
cc1-32324                     0          2          0          0        0      wakeup-0=2
cc1-23933                     0          3          0          0        0      wakeup-0=3
cc1-22001                     0         36          0          0        0      wakeup-0=36
bench-stresshig-3942        287        287      80186       6295    12196      direct-9=287       wakeup-9=287
cc1-28170                     0          7          0          0        0      wakeup-0=7
date-7932                     0         92          0          0        0      wakeup-0=92
cc1-22222                     0          6          0          0        0      wakeup-0=6
cc1-32334                     0         16          0          0        0      wakeup-0=16
cc1-2690                      0          6          0          0        0      wakeup-0=6
cc1-30733                     0          9          0          0        0      wakeup-0=9
cc1-32298                     0          2          0          0        0      wakeup-0=2
cc1-13743                     0         18          0          0        0      wakeup-0=18
cc1-22186                     0          4          0          0        0      wakeup-0=4
cc1-28214                     0         11          0          0        0      wakeup-0=11
cc1-13735                     0          1          0          0        0      wakeup-0=1
updatedb-8173                 0         18          0          0        0      wakeup-0=18
cc1-13750                     0          3          0          0        0      wakeup-0=3
cat-2808                      0          2          0          0        0      wakeup-0=2
cc1-15277                     0        169          0          0        0      wakeup-0=169
date-18317                    0          1          0          0        0      wakeup-0=1
cc1-15274                     0        197          0          0        0      wakeup-0=197
cc1-30732                     0          1          0          0        0      wakeup-0=1

Kswapd                   Kswapd      Order      Pages      Pages    Pages
Instance                Wakeups  Re-wakeup    Scanned    Sync-IO ASync-IO
kswapd0-311                  91         24     357302          0    34233      wake-0=31 wake-1=1 wake-9=59       rewake-0=10 rewake-1=1 rewake-9=13

Summary
Direct reclaims:     		291
Direct reclaim pages scanned:	437794
Direct reclaim write sync I/O:	6295
Direct reclaim write async I/O:	46446
Wake kswapd requests:		2152
Time stalled direct reclaim: 	519.163009000002 ms

Kswapd wakeups:			91
Kswapd pages scanned:		357302
Kswapd reclaim write sync I/O:	0
Kswapd reclaim write async I/O:	34233
Time kswapd awake:		5282.749757 ms

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
---
 .../trace/postprocess/trace-vmscan-postprocess.pl  |  654 ++++++++++++++++++++
 1 files changed, 654 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/postprocess/trace-vmscan-postprocess.pl

diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
new file mode 100644
index 0000000..b48d968
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -0,0 +1,654 @@
+#!/usr/bin/perl
+# This is a POC for reading the text representation of trace output related to
+# page reclaim. It makes an attempt to extract some high-level information on
+# what is going on. The accuracy of the parser may vary
+#
+# Example usage: trace-vmscan-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
+# other options
+#   --read-procstat	If the trace lacks process info, get it from /proc
+#   --ignore-pid	Aggregate processes of the same name together
+#
+# Copyright (c) IBM Corporation 2009
+# Author: Mel Gorman <mel@csn.ul.ie>
+use strict;
+use Getopt::Long;
+
+# Tracepoint events
+use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN	=> 1;
+use constant MM_VMSCAN_DIRECT_RECLAIM_END	=> 2;
+use constant MM_VMSCAN_KSWAPD_WAKE		=> 3;
+use constant MM_VMSCAN_KSWAPD_SLEEP		=> 4;
+use constant MM_VMSCAN_LRU_SHRINK_ACTIVE	=> 5;
+use constant MM_VMSCAN_LRU_SHRINK_INACTIVE	=> 6;
+use constant MM_VMSCAN_LRU_ISOLATE		=> 7;
+use constant MM_VMSCAN_WRITEPAGE_SYNC		=> 8;
+use constant MM_VMSCAN_WRITEPAGE_ASYNC		=> 9;
+use constant EVENT_UNKNOWN			=> 10;
+
+# Per-order events
+use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER => 11;
+use constant MM_VMSCAN_WAKEUP_KSWAPD_PERORDER 	=> 12;
+use constant MM_VMSCAN_KSWAPD_WAKE_PERORDER	=> 13;
+use constant HIGH_KSWAPD_REWAKEUP_PERORDER	=> 14;
+
+# Constants used to track state
+use constant STATE_DIRECT_BEGIN 		=> 15;
+use constant STATE_DIRECT_ORDER 		=> 16;
+use constant STATE_KSWAPD_BEGIN			=> 17;
+use constant STATE_KSWAPD_ORDER			=> 18;
+
+# High-level events extrapolated from tracepoints
+use constant HIGH_DIRECT_RECLAIM_LATENCY	=> 19;
+use constant HIGH_KSWAPD_LATENCY		=> 20;
+use constant HIGH_KSWAPD_REWAKEUP		=> 21;
+use constant HIGH_NR_SCANNED			=> 22;
+use constant HIGH_NR_TAKEN			=> 23;
+use constant HIGH_NR_RECLAIM			=> 24;
+use constant HIGH_NR_CONTIG_DIRTY		=> 25;
+
+my %perprocesspid;
+my %perprocess;
+my %last_procmap;
+my $opt_ignorepid;
+my $opt_read_procstat;
+
+my $total_wakeup_kswapd;
+my ($total_direct_reclaim, $total_direct_nr_scanned);
+my ($total_direct_latency, $total_kswapd_latency);
+my ($total_direct_writepage_sync, $total_direct_writepage_async);
+my ($total_kswapd_nr_scanned, $total_kswapd_wake);
+my ($total_kswapd_writepage_sync, $total_kswapd_writepage_async);
+
+# Catch sigint and exit on request
+my $sigint_report = 0;
+my $sigint_exit = 0;
+my $sigint_pending = 0;
+my $sigint_received = 0;
+sub sigint_handler {
+	my $current_time = time;
+	if ($current_time - 2 > $sigint_received) {
+		print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
+		$sigint_report = 1;
+	} else {
+		if (!$sigint_exit) {
+			print "Second SIGINT received quickly, exiting\n";
+		}
+		$sigint_exit++;
+	}
+
+	if ($sigint_exit > 3) {
+		print "Many SIGINTs received, exiting now without report\n";
+		exit;
+	}
+
+	$sigint_received = $current_time;
+	$sigint_pending = 1;
+}
+$SIG{INT} = "sigint_handler";
+
+# Parse command line options
+GetOptions(
+	'ignore-pid'	 =>	\$opt_ignorepid,
+	'read-procstat'	 =>	\$opt_read_procstat,
+);
+
+# Defaults for dynamically discovered regex's
+my $regex_direct_begin_default = 'order=([0-9]*) may_writepage=([0-9]*) gfp_flags=([A-Z_|]*)';
+my $regex_direct_end_default = 'nr_reclaimed=([0-9]*)';
+my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)';
+my $regex_kswapd_sleep_default = 'nid=([0-9]*)';
+my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
+my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)';
+my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)';
+my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
+my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) sync_io=([0-9]*)';
+
+# Dyanically discovered regex
+my $regex_direct_begin;
+my $regex_direct_end;
+my $regex_kswapd_wake;
+my $regex_kswapd_sleep;
+my $regex_wakeup_kswapd;
+my $regex_lru_isolate;
+my $regex_lru_shrink_inactive;
+my $regex_lru_shrink_active;
+my $regex_writepage;
+
+# Static regex used. Specified like this for readability and for use with /o
+#                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
+my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
+my $regex_statname = '[-0-9]*\s\((.*)\).*';
+my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
+
+sub generate_traceevent_regex {
+	my $event = shift;
+	my $default = shift;
+	my $regex;
+
+	# Read the event format or use the default
+	if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
+		print("WARNING: Event $event format string not found\n");
+		return $default;
+	} else {
+		my $line;
+		while (!eof(FORMAT)) {
+			$line = <FORMAT>;
+			$line =~ s/, REC->.*//;
+			if ($line =~ /^print fmt:\s"(.*)".*/) {
+				$regex = $1;
+				$regex =~ s/%s/\([0-9a-zA-Z|_]*\)/g;
+				$regex =~ s/%p/\([0-9a-f]*\)/g;
+				$regex =~ s/%d/\([-0-9]*\)/g;
+				$regex =~ s/%ld/\([-0-9]*\)/g;
+				$regex =~ s/%lu/\([0-9]*\)/g;
+			}
+		}
+	}
+
+	# Can't handle the print_flags stuff but in the context of this
+	# script, it really doesn't matter
+	$regex =~ s/\(REC.*\) \? __print_flags.*//;
+
+	# Verify fields are in the right order
+	my $tuple;
+	foreach $tuple (split /\s/, $regex) {
+		my ($key, $value) = split(/=/, $tuple);
+		my $expected = shift;
+		if ($key ne $expected) {
+			print("WARNING: Format not as expected for event $event '$key' != '$expected'\n");
+			$regex =~ s/$key=\((.*)\)/$key=$1/;
+		}
+	}
+
+	if (defined shift) {
+		die("Fewer fields than expected in format");
+	}
+
+	return $regex;
+}
+
+$regex_direct_begin = generate_traceevent_regex(
+			"vmscan/mm_vmscan_direct_reclaim_begin",
+			$regex_direct_begin_default,
+			"order", "may_writepage",
+			"gfp_flags");
+$regex_direct_end = generate_traceevent_regex(
+			"vmscan/mm_vmscan_direct_reclaim_end",
+			$regex_direct_end_default,
+			"nr_reclaimed");
+$regex_kswapd_wake = generate_traceevent_regex(
+			"vmscan/mm_vmscan_kswapd_wake",
+			$regex_kswapd_wake_default,
+			"nid", "order");
+$regex_kswapd_sleep = generate_traceevent_regex(
+			"vmscan/mm_vmscan_kswapd_sleep",
+			$regex_kswapd_sleep_default,
+			"nid");
+$regex_wakeup_kswapd = generate_traceevent_regex(
+			"vmscan/mm_vmscan_wakeup_kswapd",
+			$regex_wakeup_kswapd_default,
+			"nid", "zid", "order");
+$regex_lru_isolate = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_isolate",
+			$regex_lru_isolate_default,
+			"isolate_mode", "order",
+			"nr_requested", "nr_scanned", "nr_taken",
+			"contig_taken", "contig_dirty", "contig_failed");
+$regex_lru_shrink_inactive = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_shrink_inactive",
+			$regex_lru_shrink_inactive_default,
+			"nid", "zid",
+			"lru",
+			"nr_scanned", "nr_reclaimed", "priority");
+$regex_lru_shrink_active = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_shrink_active",
+			$regex_lru_shrink_active_default,
+			"nid", "zid",
+			"lru",
+			"nr_scanned", "nr_rotated", "priority");
+$regex_writepage = generate_traceevent_regex(
+			"vmscan/mm_vmscan_writepage",
+			$regex_writepage_default,
+			"page", "pfn", "sync_io");
+
+sub read_statline($) {
+	my $pid = $_[0];
+	my $statline;
+
+	if (open(STAT, "/proc/$pid/stat")) {
+		$statline = <STAT>;
+		close(STAT);
+	}
+
+	if ($statline eq '') {
+		$statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
+	}
+
+	return $statline;
+}
+
+sub guess_process_pid($$) {
+	my $pid = $_[0];
+	my $statline = $_[1];
+
+	if ($pid == 0) {
+		return "swapper-0";
+	}
+
+	if ($statline !~ /$regex_statname/o) {
+		die("Failed to math stat line for process name :: $statline");
+	}
+	return "$1-$pid";
+}
+
+# Convert sec.usec timestamp format
+sub timestamp_to_ms($) {
+	my $timestamp = $_[0];
+
+	my ($sec, $usec) = split (/\./, $timestamp);
+	return ($sec * 1000) + ($usec / 1000);
+}
+
+sub process_events {
+	my $traceevent;
+	my $process_pid;
+	my $cpus;
+	my $timestamp;
+	my $tracepoint;
+	my $details;
+	my $statline;
+
+	# Read each line of the event log
+EVENT_PROCESS:
+	while ($traceevent = <STDIN>) {
+		if ($traceevent =~ /$regex_traceevent/o) {
+			$process_pid = $1;
+			$timestamp = $3;
+			$tracepoint = $4;
+
+			$process_pid =~ /(.*)-([0-9]*)$/;
+			my $process = $1;
+			my $pid = $2;
+
+			if ($process eq "") {
+				$process = $last_procmap{$pid};
+				$process_pid = "$process-$pid";
+			}
+			$last_procmap{$pid} = $process;
+
+			if ($opt_read_procstat) {
+				$statline = read_statline($pid);
+				if ($opt_read_procstat && $process eq '') {
+					$process_pid = guess_process_pid($pid, $statline);
+				}
+			}
+		} else {
+			next;
+		}
+
+		# Perl Switch() sucks majorly
+		if ($tracepoint eq "mm_vmscan_direct_reclaim_begin") {
+			$timestamp = timestamp_to_ms($timestamp);
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}++;
+			$perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN} = $timestamp;
+
+			$details = $5;
+			if ($details !~ /$regex_direct_begin/o) {
+				print "WARNING: Failed to parse mm_vmscan_direct_reclaim_begin as expected\n";
+				print "         $details\n";
+				print "         $regex_direct_begin\n";
+				next;
+			}
+			my $order = $1;
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]++;
+			$perprocesspid{$process_pid}->{STATE_DIRECT_ORDER} = $order;
+		} elsif ($tracepoint eq "mm_vmscan_direct_reclaim_end") {
+			# Count the event itself
+			my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END};
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END}++;
+
+			# Record how long direct reclaim took this time
+			if (defined $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN}) {
+				$timestamp = timestamp_to_ms($timestamp);
+				my $order = $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER};
+				my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN});
+				$perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] = "$order-$latency";
+			}
+		} elsif ($tracepoint eq "mm_vmscan_kswapd_wake") {
+			$details = $5;
+			if ($details !~ /$regex_kswapd_wake/o) {
+				print "WARNING: Failed to parse mm_vmscan_kswapd_wake as expected\n";
+				print "         $details\n";
+				print "         $regex_kswapd_wake\n";
+				next;
+			}
+
+			my $order = $2;
+			$perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER} = $order;
+			if (!$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN}) {
+				$timestamp = timestamp_to_ms($timestamp);
+				$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}++;
+				$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = $timestamp;
+				$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]++;
+			} else {
+				$perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}++;
+				$perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order]++;
+			}
+		} elsif ($tracepoint eq "mm_vmscan_kswapd_sleep") {
+
+			# Count the event itself
+			my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP};
+			$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP}++;
+
+			# Record how long kswapd was awake
+			$timestamp = timestamp_to_ms($timestamp);
+			my $order = $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER};
+			my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN});
+			$perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index] = "$order-$latency";
+			$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = 0;
+		} elsif ($tracepoint eq "mm_vmscan_wakeup_kswapd") {
+			$perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}++;
+
+			$details = $5;
+			if ($details !~ /$regex_wakeup_kswapd/o) {
+				print "WARNING: Failed to parse mm_vmscan_wakeup_kswapd as expected\n";
+				print "         $details\n";
+				print "         $regex_wakeup_kswapd\n";
+				next;
+			}
+			my $order = $3;
+			$perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]++;
+		} elsif ($tracepoint eq "mm_vmscan_lru_isolate") {
+			$details = $5;
+			if ($details !~ /$regex_lru_isolate/o) {
+				print "WARNING: Failed to parse mm_vmscan_lru_isolate as expected\n";
+				print "         $details\n";
+				print "         $regex_lru_isolate/o\n";
+				next;
+			}
+			my $nr_scanned = $4;
+			my $nr_contig_dirty = $7;
+			$perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
+			$perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty;
+		} elsif ($tracepoint eq "mm_vmscan_writepage") {
+			$details = $5;
+			if ($details !~ /$regex_writepage/o) {
+				print "WARNING: Failed to parse mm_vmscan_writepage as expected\n";
+				print "         $details\n";
+				print "         $regex_writepage\n";
+				next;
+			}
+
+			my $sync_io = $3;
+			if ($sync_io) {
+				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}++;
+			} else {
+				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}++;
+			}
+		} else {
+			$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
+		}
+
+		if ($sigint_pending) {
+			last EVENT_PROCESS;
+		}
+	}
+}
+
+sub dump_stats {
+	my $hashref = shift;
+	my %stats = %$hashref;
+
+	# Dump per-process stats
+	my $process_pid;
+	my $max_strlen = 0;
+
+	# Get the maximum process name
+	foreach $process_pid (keys %perprocesspid) {
+		my $len = length($process_pid);
+		if ($len > $max_strlen) {
+			$max_strlen = $len;
+		}
+	}
+	$max_strlen += 2;
+
+	# Work out latencies
+	printf("\n") if !$opt_ignorepid;
+	printf("Reclaim latencies expressed as order-latency_in_ms\n") if !$opt_ignorepid;
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[0] &&
+				!$stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[0]) {
+			next;
+		}
+
+		printf "%-" . $max_strlen . "s ", $process_pid if !$opt_ignorepid;
+		my $index = 0;
+		while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] ||
+			defined $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) {
+
+			if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) { 
+				printf("%s ", $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) if !$opt_ignorepid;
+				my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
+				$total_direct_latency += $latency;
+			} else {
+				printf("%s ", $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) if !$opt_ignorepid;
+				my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]);
+				$total_kswapd_latency += $latency;
+			}
+			$index++;
+		}
+		print "\n" if !$opt_ignorepid;
+	}
+
+	# Print out process activity
+	printf("\n");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "Process", "Direct",  "Wokeup", "Pages",   "Pages",   "Pages",     "Time");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "details", "Rclms",   "Kswapd", "Scanned", "Sync-IO", "ASync-IO",  "Stalled");
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
+			next;
+		}
+
+		$total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
+		$total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
+		$total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_direct_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$total_direct_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		my $index = 0;
+		my $this_reclaim_delay = 0;
+		while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
+			 my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
+			$this_reclaim_delay += $latency;
+			$index++;
+		}
+
+		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8u %8u %8.3f",
+			$process_pid,
+			$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
+			$stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
+			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC},
+			$this_reclaim_delay / 1000);
+
+		if ($stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
+				if ($count != 0) {
+					print "direct-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
+				if ($count != 0) {
+					print "wakeup-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY}) {
+			print "      ";
+			my $count = $stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY};
+			if ($count != 0) {
+				print "contig-dirty=$count ";
+			}
+		}
+
+		print "\n";
+	}
+
+	# Print out kswapd activity
+	printf("\n");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Kswapd",   "Kswapd",  "Order",     "Pages",   "Pages",  "Pages");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Sync-IO", "ASync-IO");
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
+			next;
+		}
+
+		$total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
+		$total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_kswapd_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$total_kswapd_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8i %8u",
+			$process_pid,
+			$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
+			$stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
+			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC});
+
+		if ($stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
+				if ($count != 0) {
+					print "wake-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order];
+				if ($count != 0) {
+					print "rewake-$order=$count ";
+				}
+			}
+		}
+		printf("\n");
+	}
+
+	# Print out summaries
+	$total_direct_latency /= 1000;
+	$total_kswapd_latency /= 1000;
+	print "\nSummary\n";
+	print "Direct reclaims:     		$total_direct_reclaim\n";
+	print "Direct reclaim pages scanned:	$total_direct_nr_scanned\n";
+	print "Direct reclaim write sync I/O:	$total_direct_writepage_sync\n";
+	print "Direct reclaim write async I/O:	$total_direct_writepage_async\n";
+	print "Wake kswapd requests:		$total_wakeup_kswapd\n";
+	printf "Time stalled direct reclaim: 	%-1.2f ms\n", $total_direct_latency;
+	print "\n";
+	print "Kswapd wakeups:			$total_kswapd_wake\n";
+	print "Kswapd pages scanned:		$total_kswapd_nr_scanned\n";
+	print "Kswapd reclaim write sync I/O:	$total_kswapd_writepage_sync\n";
+	print "Kswapd reclaim write async I/O:	$total_kswapd_writepage_async\n";
+	printf "Time kswapd awake:		%-1.2f ms\n", $total_kswapd_latency;
+}
+
+sub aggregate_perprocesspid() {
+	my $process_pid;
+	my $process;
+	undef %perprocess;
+
+	foreach $process_pid (keys %perprocesspid) {
+		$process = $process_pid;
+		$process =~ s/-([0-9])*$//;
+		if ($process eq '') {
+			$process = "NO_PROCESS_NAME";
+		}
+
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN} += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
+		$perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE} += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
+		$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
+		$perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
+		$perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		for (my $order = 0; $order < 20; $order++) {
+			$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
+			$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
+			$perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
+
+		}
+
+		# Aggregate direct reclaim latencies
+		my $wr_index = $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END};
+		my $rd_index = 0;
+		while (defined $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index]) {
+			$perprocess{$process}->{HIGH_DIRECT_RECLAIM_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index];
+			$rd_index++;
+			$wr_index++;
+		}
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
+
+		# Aggregate kswapd latencies
+		my $wr_index = $perprocess{$process}->{MM_VMSCAN_KSWAPD_SLEEP};
+		my $rd_index = 0;
+		while (defined $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index]) {
+			$perprocess{$process}->{HIGH_KSWAPD_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index];
+			$rd_index++;
+			$wr_index++;
+		}
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
+	}
+}
+
+sub report() {
+	if (!$opt_ignorepid) {
+		dump_stats(\%perprocesspid);
+	} else {
+		aggregate_perprocesspid();
+		dump_stats(\%perprocess);
+	}
+}
+
+# Process events or signals until neither is available
+sub signal_loop() {
+	my $sigint_processed;
+	do {
+		$sigint_processed = 0;
+		process_events();
+
+		# Handle pending signals if any
+		if ($sigint_pending) {
+			my $current_time = time;
+
+			if ($sigint_exit) {
+				print "Received exit signal\n";
+				$sigint_pending = 0;
+			}
+			if ($sigint_report) {
+				if ($current_time >= $sigint_received + 2) {
+					report();
+					$sigint_report = 0;
+					$sigint_pending = 0;
+					$sigint_processed = 1;
+				}
+			}
+		}
+	} while ($sigint_pending || $sigint_processed);
+}
+
+signal_loop();
+report();
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 06/14] vmscan: kill prev_priority completely
  2010-06-29 11:34 ` Mel Gorman
@ 2010-06-29 11:34   ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Since 2.6.28 zone->prev_priority is unused. Then it can be removed
safely. It reduce stack usage slightly.

Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
can be integrate again, it's useful. but four (or more) times trying
haven't got good performance number. Thus I give up such approach.

The rest of this changelog is notes on prev_priority and why it existed in
the first place and why it might be not necessary any more. This information
is based heavily on discussions between Andrew Morton, Rik van Riel and
Kosaki Motohiro who is heavily quotes from.

Historically prev_priority was important because it determined when the VM
would start unmapping PTE pages. i.e. there are no balances of note within
the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
is a potential risk of unnecessarily increasing minor faults as a large
amount of read activity of use-once pages could push mapped pages to the
end of the LRU and get unmapped.

There is no proof this is still a problem but currently it is not considered
to be. Active files are not deactivated if the active file list is smaller
than the inactive list reducing the liklihood that file-mapped pages are
being pushed off the LRU and referenced executable pages are kept on the
active list to avoid them getting pushed out by read activity.

Even if it is a problem, prev_priority prev_priority wouldn't works
nowadays. First of all, current vmscan still a lot of UP centric code. it
expose some weakness on some dozens CPUs machine. I think we need more and
more improvement.

The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
and per-task-pressure a bit. example, prev_priority try to boost priority to
other concurrent priority. but if the another task have mempolicy restriction,
it is unnecessary, but also makes wrong big latency and exceeding reclaim.
per-task based priority + prev_priority adjustment make the emulation of
per-system pressure. but it have two issue 1) too rough and brutal emulation
2) we need per-zone pressure, not per-system.

Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
prev_priority can't solve such multithreads workload issue. In other word,
prev_priority concept assume the sysmtem don't have lots threads."

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/memcontrol.h |    5 ----
 include/linux/mmzone.h     |   15 -----------
 mm/memcontrol.c            |   31 ------------------------
 mm/page_alloc.c            |    2 -
 mm/vmscan.c                |   57 --------------------------------------------
 mm/vmstat.c                |    2 -
 6 files changed, 0 insertions(+), 112 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9411d32..9f1afd3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -98,11 +98,6 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
 /*
  * For memory reclaim.
  */
-extern int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem);
-extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
-							int priority);
-extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
-							int priority);
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4d109e..b578eee 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -348,21 +348,6 @@ struct zone {
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 
 	/*
-	 * prev_priority holds the scanning priority for this zone.  It is
-	 * defined as the scanning priority at which we achieved our reclaim
-	 * target at the previous try_to_free_pages() or balance_pgdat()
-	 * invocation.
-	 *
-	 * We use prev_priority as a measure of how much stress page reclaim is
-	 * under - it drives the swappiness decision: whether to unmap mapped
-	 * pages.
-	 *
-	 * Access to both this field is quite racy even on uniprocessor.  But
-	 * it is expected to average out OK.
-	 */
-	int prev_priority;
-
-	/*
 	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
 	 * this zone's LRU.  Maintained by the pageout code.
 	 */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c6ece0a..7557f66 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -211,8 +211,6 @@ struct mem_cgroup {
 	*/
 	spinlock_t reclaim_param_lock;
 
-	int	prev_priority;	/* for recording reclaim priority */
-
 	/*
 	 * While reclaiming in a hierarchy, we cache the last child we
 	 * reclaimed from.
@@ -858,35 +856,6 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
 	return ret;
 }
 
-/*
- * prev_priority control...this will be used in memory reclaim path.
- */
-int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem)
-{
-	int prev_priority;
-
-	spin_lock(&mem->reclaim_param_lock);
-	prev_priority = mem->prev_priority;
-	spin_unlock(&mem->reclaim_param_lock);
-
-	return prev_priority;
-}
-
-void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem, int priority)
-{
-	spin_lock(&mem->reclaim_param_lock);
-	if (priority < mem->prev_priority)
-		mem->prev_priority = priority;
-	spin_unlock(&mem->reclaim_param_lock);
-}
-
-void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, int priority)
-{
-	spin_lock(&mem->reclaim_param_lock);
-	mem->prev_priority = priority;
-	spin_unlock(&mem->reclaim_param_lock);
-}
-
 static int calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_pages)
 {
 	unsigned long active;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 431214b..0b0b629 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4081,8 +4081,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		zone_seqlock_init(zone);
 		zone->zone_pgdat = pgdat;
 
-		zone->prev_priority = DEF_PRIORITY;
-
 		zone_pcp_init(zone);
 		for_each_lru(l) {
 			INIT_LIST_HEAD(&zone->lru[l].list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 20160c7..f3d95c6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1286,20 +1286,6 @@ done:
 }
 
 /*
- * We are about to scan this zone at a certain priority level.  If that priority
- * level is smaller (ie: more urgent) than the previous priority, then note
- * that priority level within the zone.  This is done so that when the next
- * process comes in to scan this zone, it will immediately start out at this
- * priority level rather than having to build up its own scanning priority.
- * Here, this priority affects only the reclaim-mapped threshold.
- */
-static inline void note_zone_scanning_priority(struct zone *zone, int priority)
-{
-	if (priority < zone->prev_priority)
-		zone->prev_priority = priority;
-}
-
-/*
  * This moves pages from the active list to the inactive list.
  *
  * We move them the other way if the page is referenced by one or more
@@ -1762,17 +1748,8 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
 		if (scanning_global_lru(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
-			note_zone_scanning_priority(zone, priority);
-
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;	/* Let kswapd poll it */
-		} else {
-			/*
-			 * Ignore cpuset limitation here. We just want to reduce
-			 * # of used pages by us regardless of memory shortage.
-			 */
-			mem_cgroup_note_reclaim_priority(sc->mem_cgroup,
-							priority);
 		}
 
 		shrink_zone(priority, zone, sc);
@@ -1878,17 +1855,6 @@ out:
 	if (priority < 0)
 		priority = 0;
 
-	if (scanning_global_lru(sc)) {
-		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-
-			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-				continue;
-
-			zone->prev_priority = priority;
-		}
-	} else
-		mem_cgroup_record_reclaim_priority(sc->mem_cgroup, priority);
-
 	delayacct_freepages_end();
 	put_mems_allowed();
 
@@ -2054,22 +2020,12 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 		.order = order,
 		.mem_cgroup = NULL,
 	};
-	/*
-	 * temp_priority is used to remember the scanning priority at which
-	 * this zone was successfully refilled to
-	 * free_pages == high_wmark_pages(zone).
-	 */
-	int temp_priority[MAX_NR_ZONES];
-
 loop_again:
 	total_scanned = 0;
 	sc.nr_reclaimed = 0;
 	sc.may_writepage = !laptop_mode;
 	count_vm_event(PAGEOUTRUN);
 
-	for (i = 0; i < pgdat->nr_zones; i++)
-		temp_priority[i] = DEF_PRIORITY;
-
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 		unsigned long lru_pages = 0;
@@ -2137,9 +2093,7 @@ loop_again:
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;
 
-			temp_priority[i] = priority;
 			sc.nr_scanned = 0;
-			note_zone_scanning_priority(zone, priority);
 
 			nid = pgdat->node_id;
 			zid = zone_idx(zone);
@@ -2212,16 +2166,6 @@ loop_again:
 			break;
 	}
 out:
-	/*
-	 * Note within each zone the priority level at which this zone was
-	 * brought into a happy state.  So that the next thread which scans this
-	 * zone will start out at that priority level.
-	 */
-	for (i = 0; i < pgdat->nr_zones; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
-		zone->prev_priority = temp_priority[i];
-	}
 	if (!all_zones_ok) {
 		cond_resched();
 
@@ -2641,7 +2585,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		 */
 		priority = ZONE_RECLAIM_PRIORITY;
 		do {
-			note_zone_scanning_priority(zone, priority);
 			shrink_zone(priority, zone, &sc);
 			priority--;
 		} while (priority >= 0 && sc.nr_reclaimed < nr_pages);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7759941..5c0b1b6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -853,11 +853,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 	}
 	seq_printf(m,
 		   "\n  all_unreclaimable: %u"
-		   "\n  prev_priority:     %i"
 		   "\n  start_pfn:         %lu"
 		   "\n  inactive_ratio:    %u",
 		   zone->all_unreclaimable,
-		   zone->prev_priority,
 		   zone->zone_start_pfn,
 		   zone->inactive_ratio);
 	seq_putc(m, '\n');
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 06/14] vmscan: kill prev_priority completely
@ 2010-06-29 11:34   ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Since 2.6.28 zone->prev_priority is unused. Then it can be removed
safely. It reduce stack usage slightly.

Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
can be integrate again, it's useful. but four (or more) times trying
haven't got good performance number. Thus I give up such approach.

The rest of this changelog is notes on prev_priority and why it existed in
the first place and why it might be not necessary any more. This information
is based heavily on discussions between Andrew Morton, Rik van Riel and
Kosaki Motohiro who is heavily quotes from.

Historically prev_priority was important because it determined when the VM
would start unmapping PTE pages. i.e. there are no balances of note within
the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
is a potential risk of unnecessarily increasing minor faults as a large
amount of read activity of use-once pages could push mapped pages to the
end of the LRU and get unmapped.

There is no proof this is still a problem but currently it is not considered
to be. Active files are not deactivated if the active file list is smaller
than the inactive list reducing the liklihood that file-mapped pages are
being pushed off the LRU and referenced executable pages are kept on the
active list to avoid them getting pushed out by read activity.

Even if it is a problem, prev_priority prev_priority wouldn't works
nowadays. First of all, current vmscan still a lot of UP centric code. it
expose some weakness on some dozens CPUs machine. I think we need more and
more improvement.

The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
and per-task-pressure a bit. example, prev_priority try to boost priority to
other concurrent priority. but if the another task have mempolicy restriction,
it is unnecessary, but also makes wrong big latency and exceeding reclaim.
per-task based priority + prev_priority adjustment make the emulation of
per-system pressure. but it have two issue 1) too rough and brutal emulation
2) we need per-zone pressure, not per-system.

Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
prev_priority can't solve such multithreads workload issue. In other word,
prev_priority concept assume the sysmtem don't have lots threads."

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/memcontrol.h |    5 ----
 include/linux/mmzone.h     |   15 -----------
 mm/memcontrol.c            |   31 ------------------------
 mm/page_alloc.c            |    2 -
 mm/vmscan.c                |   57 --------------------------------------------
 mm/vmstat.c                |    2 -
 6 files changed, 0 insertions(+), 112 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9411d32..9f1afd3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -98,11 +98,6 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
 /*
  * For memory reclaim.
  */
-extern int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem);
-extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
-							int priority);
-extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
-							int priority);
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4d109e..b578eee 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -348,21 +348,6 @@ struct zone {
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 
 	/*
-	 * prev_priority holds the scanning priority for this zone.  It is
-	 * defined as the scanning priority at which we achieved our reclaim
-	 * target at the previous try_to_free_pages() or balance_pgdat()
-	 * invocation.
-	 *
-	 * We use prev_priority as a measure of how much stress page reclaim is
-	 * under - it drives the swappiness decision: whether to unmap mapped
-	 * pages.
-	 *
-	 * Access to both this field is quite racy even on uniprocessor.  But
-	 * it is expected to average out OK.
-	 */
-	int prev_priority;
-
-	/*
 	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
 	 * this zone's LRU.  Maintained by the pageout code.
 	 */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c6ece0a..7557f66 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -211,8 +211,6 @@ struct mem_cgroup {
 	*/
 	spinlock_t reclaim_param_lock;
 
-	int	prev_priority;	/* for recording reclaim priority */
-
 	/*
 	 * While reclaiming in a hierarchy, we cache the last child we
 	 * reclaimed from.
@@ -858,35 +856,6 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
 	return ret;
 }
 
-/*
- * prev_priority control...this will be used in memory reclaim path.
- */
-int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem)
-{
-	int prev_priority;
-
-	spin_lock(&mem->reclaim_param_lock);
-	prev_priority = mem->prev_priority;
-	spin_unlock(&mem->reclaim_param_lock);
-
-	return prev_priority;
-}
-
-void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem, int priority)
-{
-	spin_lock(&mem->reclaim_param_lock);
-	if (priority < mem->prev_priority)
-		mem->prev_priority = priority;
-	spin_unlock(&mem->reclaim_param_lock);
-}
-
-void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, int priority)
-{
-	spin_lock(&mem->reclaim_param_lock);
-	mem->prev_priority = priority;
-	spin_unlock(&mem->reclaim_param_lock);
-}
-
 static int calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_pages)
 {
 	unsigned long active;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 431214b..0b0b629 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4081,8 +4081,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		zone_seqlock_init(zone);
 		zone->zone_pgdat = pgdat;
 
-		zone->prev_priority = DEF_PRIORITY;
-
 		zone_pcp_init(zone);
 		for_each_lru(l) {
 			INIT_LIST_HEAD(&zone->lru[l].list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 20160c7..f3d95c6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1286,20 +1286,6 @@ done:
 }
 
 /*
- * We are about to scan this zone at a certain priority level.  If that priority
- * level is smaller (ie: more urgent) than the previous priority, then note
- * that priority level within the zone.  This is done so that when the next
- * process comes in to scan this zone, it will immediately start out at this
- * priority level rather than having to build up its own scanning priority.
- * Here, this priority affects only the reclaim-mapped threshold.
- */
-static inline void note_zone_scanning_priority(struct zone *zone, int priority)
-{
-	if (priority < zone->prev_priority)
-		zone->prev_priority = priority;
-}
-
-/*
  * This moves pages from the active list to the inactive list.
  *
  * We move them the other way if the page is referenced by one or more
@@ -1762,17 +1748,8 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
 		if (scanning_global_lru(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
-			note_zone_scanning_priority(zone, priority);
-
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;	/* Let kswapd poll it */
-		} else {
-			/*
-			 * Ignore cpuset limitation here. We just want to reduce
-			 * # of used pages by us regardless of memory shortage.
-			 */
-			mem_cgroup_note_reclaim_priority(sc->mem_cgroup,
-							priority);
 		}
 
 		shrink_zone(priority, zone, sc);
@@ -1878,17 +1855,6 @@ out:
 	if (priority < 0)
 		priority = 0;
 
-	if (scanning_global_lru(sc)) {
-		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-
-			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-				continue;
-
-			zone->prev_priority = priority;
-		}
-	} else
-		mem_cgroup_record_reclaim_priority(sc->mem_cgroup, priority);
-
 	delayacct_freepages_end();
 	put_mems_allowed();
 
@@ -2054,22 +2020,12 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 		.order = order,
 		.mem_cgroup = NULL,
 	};
-	/*
-	 * temp_priority is used to remember the scanning priority at which
-	 * this zone was successfully refilled to
-	 * free_pages == high_wmark_pages(zone).
-	 */
-	int temp_priority[MAX_NR_ZONES];
-
 loop_again:
 	total_scanned = 0;
 	sc.nr_reclaimed = 0;
 	sc.may_writepage = !laptop_mode;
 	count_vm_event(PAGEOUTRUN);
 
-	for (i = 0; i < pgdat->nr_zones; i++)
-		temp_priority[i] = DEF_PRIORITY;
-
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 		unsigned long lru_pages = 0;
@@ -2137,9 +2093,7 @@ loop_again:
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;
 
-			temp_priority[i] = priority;
 			sc.nr_scanned = 0;
-			note_zone_scanning_priority(zone, priority);
 
 			nid = pgdat->node_id;
 			zid = zone_idx(zone);
@@ -2212,16 +2166,6 @@ loop_again:
 			break;
 	}
 out:
-	/*
-	 * Note within each zone the priority level at which this zone was
-	 * brought into a happy state.  So that the next thread which scans this
-	 * zone will start out at that priority level.
-	 */
-	for (i = 0; i < pgdat->nr_zones; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
-		zone->prev_priority = temp_priority[i];
-	}
 	if (!all_zones_ok) {
 		cond_resched();
 
@@ -2641,7 +2585,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		 */
 		priority = ZONE_RECLAIM_PRIORITY;
 		do {
-			note_zone_scanning_priority(zone, priority);
 			shrink_zone(priority, zone, &sc);
 			priority--;
 		} while (priority >= 0 && sc.nr_reclaimed < nr_pages);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7759941..5c0b1b6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -853,11 +853,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 	}
 	seq_printf(m,
 		   "\n  all_unreclaimable: %u"
-		   "\n  prev_priority:     %i"
 		   "\n  start_pfn:         %lu"
 		   "\n  inactive_ratio:    %u",
 		   zone->all_unreclaimable,
-		   zone->prev_priority,
 		   zone->zone_start_pfn,
 		   zone->inactive_ratio);
 	seq_putc(m, '\n');
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 07/14] vmscan: simplify shrink_inactive_list()
  2010-06-29 11:34 ` Mel Gorman
@ 2010-06-29 11:34   ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Now, max_scan of shrink_inactive_list() is always passed less than
SWAP_CLUSTER_MAX. then, we can remove scanning pages loop in it.
This patch also help stack diet.

detail
 - remove "while (nr_scanned < max_scan)" loop
 - remove nr_freed (now, we use nr_reclaimed directly)
 - remove nr_scan (now, we use nr_scanned directly)
 - rename max_scan to nr_to_scan
 - pass nr_to_scan into isolate_pages() directly instead
   using SWAP_CLUSTER_MAX

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |  211 ++++++++++++++++++++++++++++-------------------------------
 1 files changed, 100 insertions(+), 111 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f3d95c6..d964cfa 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1132,15 +1132,21 @@ static int too_many_isolated(struct zone *zone, int file,
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
  */
-static unsigned long shrink_inactive_list(unsigned long max_scan,
+static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 			struct zone *zone, struct scan_control *sc,
 			int priority, int file)
 {
 	LIST_HEAD(page_list);
 	struct pagevec pvec;
-	unsigned long nr_scanned = 0;
+	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+	struct page *page;
+	unsigned long nr_taken;
+	unsigned long nr_active;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+	unsigned long nr_anon;
+	unsigned long nr_file;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1155,129 +1161,112 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
-	do {
-		struct page *page;
-		unsigned long nr_taken;
-		unsigned long nr_scan;
-		unsigned long nr_freed;
-		unsigned long nr_active;
-		unsigned int count[NR_LRU_LISTS] = { 0, };
-		int mode = sc->lumpy_reclaim_mode ? ISOLATE_BOTH : ISOLATE_INACTIVE;
-		unsigned long nr_anon;
-		unsigned long nr_file;
 
-		if (scanning_global_lru(sc)) {
-			nr_taken = isolate_pages_global(SWAP_CLUSTER_MAX,
-							&page_list, &nr_scan,
-							sc->order, mode,
-							zone, 0, file);
-			zone->pages_scanned += nr_scan;
-			if (current_is_kswapd())
-				__count_zone_vm_events(PGSCAN_KSWAPD, zone,
-						       nr_scan);
-			else
-				__count_zone_vm_events(PGSCAN_DIRECT, zone,
-						       nr_scan);
-		} else {
-			nr_taken = mem_cgroup_isolate_pages(SWAP_CLUSTER_MAX,
-							&page_list, &nr_scan,
-							sc->order, mode,
-							zone, sc->mem_cgroup,
-							0, file);
-			/*
-			 * mem_cgroup_isolate_pages() keeps track of
-			 * scanned pages on its own.
-			 */
-		}
+	if (scanning_global_lru(sc)) {
+		nr_taken = isolate_pages_global(nr_to_scan,
+			&page_list, &nr_scanned, sc->order,
+			sc->lumpy_reclaim_mode ? ISOLATE_BOTH : ISOLATE_INACTIVE,
+			zone, 0, file);
+		zone->pages_scanned += nr_scanned;
+		if (current_is_kswapd())
+			__count_zone_vm_events(PGSCAN_KSWAPD, zone,
+					       nr_scanned);
+		else
+			__count_zone_vm_events(PGSCAN_DIRECT, zone,
+					       nr_scanned);
+	} else {
+		nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
+			&page_list, &nr_scanned, sc->order,
+			sc->lumpy_reclaim_mode ? ISOLATE_BOTH : ISOLATE_INACTIVE,
+			zone, sc->mem_cgroup,
+			0, file);
+		/*
+		 * mem_cgroup_isolate_pages() keeps track of
+		 * scanned pages on its own.
+		 */
+	}
 
-		if (nr_taken == 0)
-			goto done;
+	if (nr_taken == 0)
+		goto done;
 
-		nr_active = clear_active_flags(&page_list, count);
-		__count_vm_events(PGDEACTIVATE, nr_active);
+	nr_active = clear_active_flags(&page_list, count);
+	__count_vm_events(PGDEACTIVATE, nr_active);
 
-		__mod_zone_page_state(zone, NR_ACTIVE_FILE,
-						-count[LRU_ACTIVE_FILE]);
-		__mod_zone_page_state(zone, NR_INACTIVE_FILE,
-						-count[LRU_INACTIVE_FILE]);
-		__mod_zone_page_state(zone, NR_ACTIVE_ANON,
-						-count[LRU_ACTIVE_ANON]);
-		__mod_zone_page_state(zone, NR_INACTIVE_ANON,
-						-count[LRU_INACTIVE_ANON]);
+	__mod_zone_page_state(zone, NR_ACTIVE_FILE,
+					-count[LRU_ACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_INACTIVE_FILE,
+					-count[LRU_INACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_ACTIVE_ANON,
+					-count[LRU_ACTIVE_ANON]);
+	__mod_zone_page_state(zone, NR_INACTIVE_ANON,
+					-count[LRU_INACTIVE_ANON]);
 
-		nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
-		nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
-		__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
-		__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
+	nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+	nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
 
-		reclaim_stat->recent_scanned[0] += nr_anon;
-		reclaim_stat->recent_scanned[1] += nr_file;
+	reclaim_stat->recent_scanned[0] += nr_anon;
+	reclaim_stat->recent_scanned[1] += nr_file;
 
-		spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(&zone->lru_lock);
 
-		nr_scanned += nr_scan;
-		nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+
+	/*
+	 * If we are direct reclaiming for contiguous pages and we do
+	 * not reclaim everything in the list, try again and wait
+	 * for IO to complete. This will stall high-order allocations
+	 * but that should be acceptable to the caller
+	 */
+	if (nr_reclaimed < nr_taken && !current_is_kswapd() && sc->lumpy_reclaim_mode) {
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
-		 * If we are direct reclaiming for contiguous pages and we do
-		 * not reclaim everything in the list, try again and wait
-		 * for IO to complete. This will stall high-order allocations
-		 * but that should be acceptable to the caller
+		 * The attempt at page out may have made some
+		 * of the pages active, mark them inactive again.
 		 */
-		if (nr_freed < nr_taken && !current_is_kswapd() &&
-		    sc->lumpy_reclaim_mode) {
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
-
-			/*
-			 * The attempt at page out may have made some
-			 * of the pages active, mark them inactive again.
-			 */
-			nr_active = clear_active_flags(&page_list, count);
-			count_vm_events(PGDEACTIVATE, nr_active);
-
-			nr_freed += shrink_page_list(&page_list, sc,
-							PAGEOUT_IO_SYNC);
-		}
+		nr_active = clear_active_flags(&page_list, count);
+		count_vm_events(PGDEACTIVATE, nr_active);
 
-		nr_reclaimed += nr_freed;
+		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+	}
 
-		local_irq_disable();
-		if (current_is_kswapd())
-			__count_vm_events(KSWAPD_STEAL, nr_freed);
-		__count_zone_vm_events(PGSTEAL, zone, nr_freed);
+	local_irq_disable();
+	if (current_is_kswapd())
+		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
+	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-		spin_lock(&zone->lru_lock);
-		/*
-		 * Put back any unfreeable pages.
-		 */
-		while (!list_empty(&page_list)) {
-			int lru;
-			page = lru_to_page(&page_list);
-			VM_BUG_ON(PageLRU(page));
-			list_del(&page->lru);
-			if (unlikely(!page_evictable(page, NULL))) {
-				spin_unlock_irq(&zone->lru_lock);
-				putback_lru_page(page);
-				spin_lock_irq(&zone->lru_lock);
-				continue;
-			}
-			SetPageLRU(page);
-			lru = page_lru(page);
-			add_page_to_lru_list(zone, page, lru);
-			if (is_active_lru(lru)) {
-				int file = is_file_lru(lru);
-				reclaim_stat->recent_rotated[file]++;
-			}
-			if (!pagevec_add(&pvec, page)) {
-				spin_unlock_irq(&zone->lru_lock);
-				__pagevec_release(&pvec);
-				spin_lock_irq(&zone->lru_lock);
-			}
+	spin_lock(&zone->lru_lock);
+	/*
+	 * Put back any unfreeable pages.
+	 */
+	while (!list_empty(&page_list)) {
+		int lru;
+		page = lru_to_page(&page_list);
+		VM_BUG_ON(PageLRU(page));
+		list_del(&page->lru);
+		if (unlikely(!page_evictable(page, NULL))) {
+			spin_unlock_irq(&zone->lru_lock);
+			putback_lru_page(page);
+			spin_lock_irq(&zone->lru_lock);
+			continue;
 		}
-		__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
-		__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
-
-  	} while (nr_scanned < max_scan);
+		SetPageLRU(page);
+		lru = page_lru(page);
+		add_page_to_lru_list(zone, page, lru);
+		if (is_active_lru(lru)) {
+			int file = is_file_lru(lru);
+			reclaim_stat->recent_rotated[file]++;
+		}
+		if (!pagevec_add(&pvec, page)) {
+			spin_unlock_irq(&zone->lru_lock);
+			__pagevec_release(&pvec);
+			spin_lock_irq(&zone->lru_lock);
+		}
+	}
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
 
 done:
 	spin_unlock_irq(&zone->lru_lock);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 07/14] vmscan: simplify shrink_inactive_list()
@ 2010-06-29 11:34   ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Now, max_scan of shrink_inactive_list() is always passed less than
SWAP_CLUSTER_MAX. then, we can remove scanning pages loop in it.
This patch also help stack diet.

detail
 - remove "while (nr_scanned < max_scan)" loop
 - remove nr_freed (now, we use nr_reclaimed directly)
 - remove nr_scan (now, we use nr_scanned directly)
 - rename max_scan to nr_to_scan
 - pass nr_to_scan into isolate_pages() directly instead
   using SWAP_CLUSTER_MAX

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |  211 ++++++++++++++++++++++++++++-------------------------------
 1 files changed, 100 insertions(+), 111 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f3d95c6..d964cfa 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1132,15 +1132,21 @@ static int too_many_isolated(struct zone *zone, int file,
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
  */
-static unsigned long shrink_inactive_list(unsigned long max_scan,
+static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 			struct zone *zone, struct scan_control *sc,
 			int priority, int file)
 {
 	LIST_HEAD(page_list);
 	struct pagevec pvec;
-	unsigned long nr_scanned = 0;
+	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+	struct page *page;
+	unsigned long nr_taken;
+	unsigned long nr_active;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+	unsigned long nr_anon;
+	unsigned long nr_file;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1155,129 +1161,112 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
-	do {
-		struct page *page;
-		unsigned long nr_taken;
-		unsigned long nr_scan;
-		unsigned long nr_freed;
-		unsigned long nr_active;
-		unsigned int count[NR_LRU_LISTS] = { 0, };
-		int mode = sc->lumpy_reclaim_mode ? ISOLATE_BOTH : ISOLATE_INACTIVE;
-		unsigned long nr_anon;
-		unsigned long nr_file;
 
-		if (scanning_global_lru(sc)) {
-			nr_taken = isolate_pages_global(SWAP_CLUSTER_MAX,
-							&page_list, &nr_scan,
-							sc->order, mode,
-							zone, 0, file);
-			zone->pages_scanned += nr_scan;
-			if (current_is_kswapd())
-				__count_zone_vm_events(PGSCAN_KSWAPD, zone,
-						       nr_scan);
-			else
-				__count_zone_vm_events(PGSCAN_DIRECT, zone,
-						       nr_scan);
-		} else {
-			nr_taken = mem_cgroup_isolate_pages(SWAP_CLUSTER_MAX,
-							&page_list, &nr_scan,
-							sc->order, mode,
-							zone, sc->mem_cgroup,
-							0, file);
-			/*
-			 * mem_cgroup_isolate_pages() keeps track of
-			 * scanned pages on its own.
-			 */
-		}
+	if (scanning_global_lru(sc)) {
+		nr_taken = isolate_pages_global(nr_to_scan,
+			&page_list, &nr_scanned, sc->order,
+			sc->lumpy_reclaim_mode ? ISOLATE_BOTH : ISOLATE_INACTIVE,
+			zone, 0, file);
+		zone->pages_scanned += nr_scanned;
+		if (current_is_kswapd())
+			__count_zone_vm_events(PGSCAN_KSWAPD, zone,
+					       nr_scanned);
+		else
+			__count_zone_vm_events(PGSCAN_DIRECT, zone,
+					       nr_scanned);
+	} else {
+		nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
+			&page_list, &nr_scanned, sc->order,
+			sc->lumpy_reclaim_mode ? ISOLATE_BOTH : ISOLATE_INACTIVE,
+			zone, sc->mem_cgroup,
+			0, file);
+		/*
+		 * mem_cgroup_isolate_pages() keeps track of
+		 * scanned pages on its own.
+		 */
+	}
 
-		if (nr_taken == 0)
-			goto done;
+	if (nr_taken == 0)
+		goto done;
 
-		nr_active = clear_active_flags(&page_list, count);
-		__count_vm_events(PGDEACTIVATE, nr_active);
+	nr_active = clear_active_flags(&page_list, count);
+	__count_vm_events(PGDEACTIVATE, nr_active);
 
-		__mod_zone_page_state(zone, NR_ACTIVE_FILE,
-						-count[LRU_ACTIVE_FILE]);
-		__mod_zone_page_state(zone, NR_INACTIVE_FILE,
-						-count[LRU_INACTIVE_FILE]);
-		__mod_zone_page_state(zone, NR_ACTIVE_ANON,
-						-count[LRU_ACTIVE_ANON]);
-		__mod_zone_page_state(zone, NR_INACTIVE_ANON,
-						-count[LRU_INACTIVE_ANON]);
+	__mod_zone_page_state(zone, NR_ACTIVE_FILE,
+					-count[LRU_ACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_INACTIVE_FILE,
+					-count[LRU_INACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_ACTIVE_ANON,
+					-count[LRU_ACTIVE_ANON]);
+	__mod_zone_page_state(zone, NR_INACTIVE_ANON,
+					-count[LRU_INACTIVE_ANON]);
 
-		nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
-		nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
-		__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
-		__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
+	nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+	nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
 
-		reclaim_stat->recent_scanned[0] += nr_anon;
-		reclaim_stat->recent_scanned[1] += nr_file;
+	reclaim_stat->recent_scanned[0] += nr_anon;
+	reclaim_stat->recent_scanned[1] += nr_file;
 
-		spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(&zone->lru_lock);
 
-		nr_scanned += nr_scan;
-		nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+
+	/*
+	 * If we are direct reclaiming for contiguous pages and we do
+	 * not reclaim everything in the list, try again and wait
+	 * for IO to complete. This will stall high-order allocations
+	 * but that should be acceptable to the caller
+	 */
+	if (nr_reclaimed < nr_taken && !current_is_kswapd() && sc->lumpy_reclaim_mode) {
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
-		 * If we are direct reclaiming for contiguous pages and we do
-		 * not reclaim everything in the list, try again and wait
-		 * for IO to complete. This will stall high-order allocations
-		 * but that should be acceptable to the caller
+		 * The attempt at page out may have made some
+		 * of the pages active, mark them inactive again.
 		 */
-		if (nr_freed < nr_taken && !current_is_kswapd() &&
-		    sc->lumpy_reclaim_mode) {
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
-
-			/*
-			 * The attempt at page out may have made some
-			 * of the pages active, mark them inactive again.
-			 */
-			nr_active = clear_active_flags(&page_list, count);
-			count_vm_events(PGDEACTIVATE, nr_active);
-
-			nr_freed += shrink_page_list(&page_list, sc,
-							PAGEOUT_IO_SYNC);
-		}
+		nr_active = clear_active_flags(&page_list, count);
+		count_vm_events(PGDEACTIVATE, nr_active);
 
-		nr_reclaimed += nr_freed;
+		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+	}
 
-		local_irq_disable();
-		if (current_is_kswapd())
-			__count_vm_events(KSWAPD_STEAL, nr_freed);
-		__count_zone_vm_events(PGSTEAL, zone, nr_freed);
+	local_irq_disable();
+	if (current_is_kswapd())
+		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
+	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-		spin_lock(&zone->lru_lock);
-		/*
-		 * Put back any unfreeable pages.
-		 */
-		while (!list_empty(&page_list)) {
-			int lru;
-			page = lru_to_page(&page_list);
-			VM_BUG_ON(PageLRU(page));
-			list_del(&page->lru);
-			if (unlikely(!page_evictable(page, NULL))) {
-				spin_unlock_irq(&zone->lru_lock);
-				putback_lru_page(page);
-				spin_lock_irq(&zone->lru_lock);
-				continue;
-			}
-			SetPageLRU(page);
-			lru = page_lru(page);
-			add_page_to_lru_list(zone, page, lru);
-			if (is_active_lru(lru)) {
-				int file = is_file_lru(lru);
-				reclaim_stat->recent_rotated[file]++;
-			}
-			if (!pagevec_add(&pvec, page)) {
-				spin_unlock_irq(&zone->lru_lock);
-				__pagevec_release(&pvec);
-				spin_lock_irq(&zone->lru_lock);
-			}
+	spin_lock(&zone->lru_lock);
+	/*
+	 * Put back any unfreeable pages.
+	 */
+	while (!list_empty(&page_list)) {
+		int lru;
+		page = lru_to_page(&page_list);
+		VM_BUG_ON(PageLRU(page));
+		list_del(&page->lru);
+		if (unlikely(!page_evictable(page, NULL))) {
+			spin_unlock_irq(&zone->lru_lock);
+			putback_lru_page(page);
+			spin_lock_irq(&zone->lru_lock);
+			continue;
 		}
-		__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
-		__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
-
-  	} while (nr_scanned < max_scan);
+		SetPageLRU(page);
+		lru = page_lru(page);
+		add_page_to_lru_list(zone, page, lru);
+		if (is_active_lru(lru)) {
+			int file = is_file_lru(lru);
+			reclaim_stat->recent_rotated[file]++;
+		}
+		if (!pagevec_add(&pvec, page)) {
+			spin_unlock_irq(&zone->lru_lock);
+			__pagevec_release(&pvec);
+			spin_lock_irq(&zone->lru_lock);
+		}
+	}
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
 
 done:
 	spin_unlock_irq(&zone->lru_lock);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 08/14] vmscan: Remove unnecessary temporary vars in do_try_to_free_pages
  2010-06-29 11:34 ` Mel Gorman
@ 2010-06-29 11:34   ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

Remove temporary variable that is only used once and does not help
clarify code.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |    8 +++-----
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d964cfa..509d093 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1721,13 +1721,12 @@ static void shrink_zone(int priority, struct zone *zone,
 static bool shrink_zones(int priority, struct zonelist *zonelist,
 					struct scan_control *sc)
 {
-	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	struct zoneref *z;
 	struct zone *zone;
 	bool all_unreclaimable = true;
 
-	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
-					sc->nodemask) {
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
 			continue;
 		/*
@@ -1773,7 +1772,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	unsigned long lru_pages = 0;
 	struct zoneref *z;
 	struct zone *zone;
-	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	unsigned long writeback_threshold;
 
 	get_mems_allowed();
@@ -1785,7 +1783,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	 * mem_cgroup will not do shrink_slab.
 	 */
 	if (scanning_global_lru(sc)) {
-		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+		for_each_zone_zonelist(zone, z, zonelist, gfp_zone(sc->gfp_mask)) {
 
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 08/14] vmscan: Remove unnecessary temporary vars in do_try_to_free_pages
@ 2010-06-29 11:34   ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

Remove temporary variable that is only used once and does not help
clarify code.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |    8 +++-----
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d964cfa..509d093 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1721,13 +1721,12 @@ static void shrink_zone(int priority, struct zone *zone,
 static bool shrink_zones(int priority, struct zonelist *zonelist,
 					struct scan_control *sc)
 {
-	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	struct zoneref *z;
 	struct zone *zone;
 	bool all_unreclaimable = true;
 
-	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
-					sc->nodemask) {
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
 			continue;
 		/*
@@ -1773,7 +1772,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	unsigned long lru_pages = 0;
 	struct zoneref *z;
 	struct zone *zone;
-	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	unsigned long writeback_threshold;
 
 	get_mems_allowed();
@@ -1785,7 +1783,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	 * mem_cgroup will not do shrink_slab.
 	 */
 	if (scanning_global_lru(sc)) {
-		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+		for_each_zone_zonelist(zone, z, zonelist, gfp_zone(sc->gfp_mask)) {
 
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 09/14] vmscan: Setup pagevec as late as possible in shrink_inactive_list()
  2010-06-29 11:34 ` Mel Gorman
@ 2010-06-29 11:34   ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

shrink_inactive_list() sets up a pagevec to release unfreeable pages. It
uses significant amounts of stack doing this. This patch splits
shrink_inactive_list() to take the stack usage out of the main path so
that callers to writepage() do not contain an unused pagevec on the
stack.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   99 +++++++++++++++++++++++++++++++++-------------------------
 1 files changed, 56 insertions(+), 43 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 509d093..8b4ed48 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1129,19 +1129,65 @@ static int too_many_isolated(struct zone *zone, int file,
 }
 
 /*
+ * TODO: Try merging with migrations version of putback_lru_pages
+ */
+static noinline_for_stack void
+putback_lru_pages(struct zone *zone, struct zone_reclaim_stat *reclaim_stat,
+				unsigned long nr_anon, unsigned long nr_file,
+				struct list_head *page_list)
+{
+	struct page *page;
+	struct pagevec pvec;
+
+	pagevec_init(&pvec, 1);
+
+	/*
+	 * Put back any unfreeable pages.
+	 */
+	spin_lock(&zone->lru_lock);
+	while (!list_empty(page_list)) {
+		int lru;
+		page = lru_to_page(page_list);
+		VM_BUG_ON(PageLRU(page));
+		list_del(&page->lru);
+		if (unlikely(!page_evictable(page, NULL))) {
+			spin_unlock_irq(&zone->lru_lock);
+			putback_lru_page(page);
+			spin_lock_irq(&zone->lru_lock);
+			continue;
+		}
+		SetPageLRU(page);
+		lru = page_lru(page);
+		add_page_to_lru_list(zone, page, lru);
+		if (is_active_lru(lru)) {
+			int file = is_file_lru(lru);
+			reclaim_stat->recent_rotated[file]++;
+		}
+		if (!pagevec_add(&pvec, page)) {
+			spin_unlock_irq(&zone->lru_lock);
+			__pagevec_release(&pvec);
+			spin_lock_irq(&zone->lru_lock);
+		}
+	}
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
+
+	spin_unlock_irq(&zone->lru_lock);
+	pagevec_release(&pvec);
+}
+
+/*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
  */
-static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
-			struct zone *zone, struct scan_control *sc,
-			int priority, int file)
+static noinline_for_stack unsigned long
+shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
+			struct scan_control *sc, int priority, int file)
 {
 	LIST_HEAD(page_list);
-	struct pagevec pvec;
 	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
-	struct page *page;
 	unsigned long nr_taken;
 	unsigned long nr_active;
 	unsigned int count[NR_LRU_LISTS] = { 0, };
@@ -1157,8 +1203,6 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	}
 
 
-	pagevec_init(&pvec, 1);
-
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
 
@@ -1186,8 +1230,10 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 		 */
 	}
 
-	if (nr_taken == 0)
-		goto done;
+	if (nr_taken == 0) {
+		spin_unlock_irq(&zone->lru_lock);
+		return 0;
+	}
 
 	nr_active = clear_active_flags(&page_list, count);
 	__count_vm_events(PGDEACTIVATE, nr_active);
@@ -1237,40 +1283,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-	spin_lock(&zone->lru_lock);
-	/*
-	 * Put back any unfreeable pages.
-	 */
-	while (!list_empty(&page_list)) {
-		int lru;
-		page = lru_to_page(&page_list);
-		VM_BUG_ON(PageLRU(page));
-		list_del(&page->lru);
-		if (unlikely(!page_evictable(page, NULL))) {
-			spin_unlock_irq(&zone->lru_lock);
-			putback_lru_page(page);
-			spin_lock_irq(&zone->lru_lock);
-			continue;
-		}
-		SetPageLRU(page);
-		lru = page_lru(page);
-		add_page_to_lru_list(zone, page, lru);
-		if (is_active_lru(lru)) {
-			int file = is_file_lru(lru);
-			reclaim_stat->recent_rotated[file]++;
-		}
-		if (!pagevec_add(&pvec, page)) {
-			spin_unlock_irq(&zone->lru_lock);
-			__pagevec_release(&pvec);
-			spin_lock_irq(&zone->lru_lock);
-		}
-	}
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
-	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
-
-done:
-	spin_unlock_irq(&zone->lru_lock);
-	pagevec_release(&pvec);
+	putback_lru_pages(zone, reclaim_stat, nr_anon, nr_file, &page_list);
 	return nr_reclaimed;
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 09/14] vmscan: Setup pagevec as late as possible in shrink_inactive_list()
@ 2010-06-29 11:34   ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

shrink_inactive_list() sets up a pagevec to release unfreeable pages. It
uses significant amounts of stack doing this. This patch splits
shrink_inactive_list() to take the stack usage out of the main path so
that callers to writepage() do not contain an unused pagevec on the
stack.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   99 +++++++++++++++++++++++++++++++++-------------------------
 1 files changed, 56 insertions(+), 43 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 509d093..8b4ed48 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1129,19 +1129,65 @@ static int too_many_isolated(struct zone *zone, int file,
 }
 
 /*
+ * TODO: Try merging with migrations version of putback_lru_pages
+ */
+static noinline_for_stack void
+putback_lru_pages(struct zone *zone, struct zone_reclaim_stat *reclaim_stat,
+				unsigned long nr_anon, unsigned long nr_file,
+				struct list_head *page_list)
+{
+	struct page *page;
+	struct pagevec pvec;
+
+	pagevec_init(&pvec, 1);
+
+	/*
+	 * Put back any unfreeable pages.
+	 */
+	spin_lock(&zone->lru_lock);
+	while (!list_empty(page_list)) {
+		int lru;
+		page = lru_to_page(page_list);
+		VM_BUG_ON(PageLRU(page));
+		list_del(&page->lru);
+		if (unlikely(!page_evictable(page, NULL))) {
+			spin_unlock_irq(&zone->lru_lock);
+			putback_lru_page(page);
+			spin_lock_irq(&zone->lru_lock);
+			continue;
+		}
+		SetPageLRU(page);
+		lru = page_lru(page);
+		add_page_to_lru_list(zone, page, lru);
+		if (is_active_lru(lru)) {
+			int file = is_file_lru(lru);
+			reclaim_stat->recent_rotated[file]++;
+		}
+		if (!pagevec_add(&pvec, page)) {
+			spin_unlock_irq(&zone->lru_lock);
+			__pagevec_release(&pvec);
+			spin_lock_irq(&zone->lru_lock);
+		}
+	}
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
+
+	spin_unlock_irq(&zone->lru_lock);
+	pagevec_release(&pvec);
+}
+
+/*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
  */
-static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
-			struct zone *zone, struct scan_control *sc,
-			int priority, int file)
+static noinline_for_stack unsigned long
+shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
+			struct scan_control *sc, int priority, int file)
 {
 	LIST_HEAD(page_list);
-	struct pagevec pvec;
 	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
-	struct page *page;
 	unsigned long nr_taken;
 	unsigned long nr_active;
 	unsigned int count[NR_LRU_LISTS] = { 0, };
@@ -1157,8 +1203,6 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	}
 
 
-	pagevec_init(&pvec, 1);
-
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
 
@@ -1186,8 +1230,10 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 		 */
 	}
 
-	if (nr_taken == 0)
-		goto done;
+	if (nr_taken == 0) {
+		spin_unlock_irq(&zone->lru_lock);
+		return 0;
+	}
 
 	nr_active = clear_active_flags(&page_list, count);
 	__count_vm_events(PGDEACTIVATE, nr_active);
@@ -1237,40 +1283,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-	spin_lock(&zone->lru_lock);
-	/*
-	 * Put back any unfreeable pages.
-	 */
-	while (!list_empty(&page_list)) {
-		int lru;
-		page = lru_to_page(&page_list);
-		VM_BUG_ON(PageLRU(page));
-		list_del(&page->lru);
-		if (unlikely(!page_evictable(page, NULL))) {
-			spin_unlock_irq(&zone->lru_lock);
-			putback_lru_page(page);
-			spin_lock_irq(&zone->lru_lock);
-			continue;
-		}
-		SetPageLRU(page);
-		lru = page_lru(page);
-		add_page_to_lru_list(zone, page, lru);
-		if (is_active_lru(lru)) {
-			int file = is_file_lru(lru);
-			reclaim_stat->recent_rotated[file]++;
-		}
-		if (!pagevec_add(&pvec, page)) {
-			spin_unlock_irq(&zone->lru_lock);
-			__pagevec_release(&pvec);
-			spin_lock_irq(&zone->lru_lock);
-		}
-	}
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
-	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
-
-done:
-	spin_unlock_irq(&zone->lru_lock);
-	pagevec_release(&pvec);
+	putback_lru_pages(zone, reclaim_stat, nr_anon, nr_file, &page_list);
 	return nr_reclaimed;
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 10/14] vmscan: Setup pagevec as late as possible in shrink_page_list()
  2010-06-29 11:34 ` Mel Gorman
@ 2010-06-29 11:34   ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

shrink_page_list() sets up a pagevec to release pages as according as they
are free. It uses significant amounts of stack on the pagevec. This
patch adds pages to be freed via pagevec to a linked list which is then
freed en-masse at the end. This avoids using stack in the main path that
potentially calls writepage().

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   36 ++++++++++++++++++++++++++++--------
 1 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8b4ed48..1107830 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -620,6 +620,24 @@ static enum page_references page_check_references(struct page *page,
 	return PAGEREF_RECLAIM;
 }
 
+static noinline_for_stack void free_page_list(struct list_head *free_pages)
+{
+	struct pagevec freed_pvec;
+	struct page *page, *tmp;
+
+	pagevec_init(&freed_pvec, 1);
+
+	list_for_each_entry_safe(page, tmp, free_pages, lru) {
+		list_del(&page->lru);
+		if (!pagevec_add(&freed_pvec, page)) {
+			__pagevec_free(&freed_pvec);
+			pagevec_reinit(&freed_pvec);
+		}
+	}
+
+	pagevec_free(&freed_pvec);
+}
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -628,13 +646,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 					enum pageout_io sync_writeback)
 {
 	LIST_HEAD(ret_pages);
-	struct pagevec freed_pvec;
+	LIST_HEAD(free_pages);
 	int pgactivate = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
 
-	pagevec_init(&freed_pvec, 1);
 	while (!list_empty(page_list)) {
 		enum page_references references;
 		struct address_space *mapping;
@@ -809,10 +826,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		__clear_page_locked(page);
 free_it:
 		nr_reclaimed++;
-		if (!pagevec_add(&freed_pvec, page)) {
-			__pagevec_free(&freed_pvec);
-			pagevec_reinit(&freed_pvec);
-		}
+
+		/*
+		 * Is there need to periodically free_page_list? It would
+		 * appear not as the counts should be low
+		 */
+		list_add(&page->lru, &free_pages);
 		continue;
 
 cull_mlocked:
@@ -835,9 +854,10 @@ keep:
 		list_add(&page->lru, &ret_pages);
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
+
+	free_page_list(&free_pages);
+
 	list_splice(&ret_pages, page_list);
-	if (pagevec_count(&freed_pvec))
-		__pagevec_free(&freed_pvec);
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 10/14] vmscan: Setup pagevec as late as possible in shrink_page_list()
@ 2010-06-29 11:34   ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

shrink_page_list() sets up a pagevec to release pages as according as they
are free. It uses significant amounts of stack on the pagevec. This
patch adds pages to be freed via pagevec to a linked list which is then
freed en-masse at the end. This avoids using stack in the main path that
potentially calls writepage().

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   36 ++++++++++++++++++++++++++++--------
 1 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8b4ed48..1107830 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -620,6 +620,24 @@ static enum page_references page_check_references(struct page *page,
 	return PAGEREF_RECLAIM;
 }
 
+static noinline_for_stack void free_page_list(struct list_head *free_pages)
+{
+	struct pagevec freed_pvec;
+	struct page *page, *tmp;
+
+	pagevec_init(&freed_pvec, 1);
+
+	list_for_each_entry_safe(page, tmp, free_pages, lru) {
+		list_del(&page->lru);
+		if (!pagevec_add(&freed_pvec, page)) {
+			__pagevec_free(&freed_pvec);
+			pagevec_reinit(&freed_pvec);
+		}
+	}
+
+	pagevec_free(&freed_pvec);
+}
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -628,13 +646,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 					enum pageout_io sync_writeback)
 {
 	LIST_HEAD(ret_pages);
-	struct pagevec freed_pvec;
+	LIST_HEAD(free_pages);
 	int pgactivate = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
 
-	pagevec_init(&freed_pvec, 1);
 	while (!list_empty(page_list)) {
 		enum page_references references;
 		struct address_space *mapping;
@@ -809,10 +826,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		__clear_page_locked(page);
 free_it:
 		nr_reclaimed++;
-		if (!pagevec_add(&freed_pvec, page)) {
-			__pagevec_free(&freed_pvec);
-			pagevec_reinit(&freed_pvec);
-		}
+
+		/*
+		 * Is there need to periodically free_page_list? It would
+		 * appear not as the counts should be low
+		 */
+		list_add(&page->lru, &free_pages);
 		continue;
 
 cull_mlocked:
@@ -835,9 +854,10 @@ keep:
 		list_add(&page->lru, &ret_pages);
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
+
+	free_page_list(&free_pages);
+
 	list_splice(&ret_pages, page_list);
-	if (pagevec_count(&freed_pvec))
-		__pagevec_free(&freed_pvec);
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 11/14] vmscan: Update isolated page counters outside of main path in shrink_inactive_list()
  2010-06-29 11:34 ` Mel Gorman
@ 2010-06-29 11:34   ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

When shrink_inactive_list() isolates pages, it updates a number of
counters using temporary variables to gather them. These consume stack
and it's in the main path that calls ->writepage(). This patch moves the
accounting updates outside of the main path to reduce stack usage.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   63 +++++++++++++++++++++++++++++++++++-----------------------
 1 files changed, 38 insertions(+), 25 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1107830..efa6ee4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1072,7 +1072,8 @@ static unsigned long clear_active_flags(struct list_head *page_list,
 			ClearPageActive(page);
 			nr_active++;
 		}
-		count[lru]++;
+		if (count)
+			count[lru]++;
 	}
 
 	return nr_active;
@@ -1152,12 +1153,13 @@ static int too_many_isolated(struct zone *zone, int file,
  * TODO: Try merging with migrations version of putback_lru_pages
  */
 static noinline_for_stack void
-putback_lru_pages(struct zone *zone, struct zone_reclaim_stat *reclaim_stat,
+putback_lru_pages(struct zone *zone, struct scan_control *sc,
 				unsigned long nr_anon, unsigned long nr_file,
 				struct list_head *page_list)
 {
 	struct page *page;
 	struct pagevec pvec;
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 
 	pagevec_init(&pvec, 1);
 
@@ -1196,6 +1198,37 @@ putback_lru_pages(struct zone *zone, struct zone_reclaim_stat *reclaim_stat,
 	pagevec_release(&pvec);
 }
 
+static noinline_for_stack void update_isolated_counts(struct zone *zone,
+					struct scan_control *sc,
+					unsigned long *nr_anon,
+					unsigned long *nr_file,
+					struct list_head *isolated_list)
+{
+	unsigned long nr_active;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+
+	nr_active = clear_active_flags(isolated_list, count);
+	__count_vm_events(PGDEACTIVATE, nr_active);
+
+	__mod_zone_page_state(zone, NR_ACTIVE_FILE,
+			      -count[LRU_ACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_INACTIVE_FILE,
+			      -count[LRU_INACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_ACTIVE_ANON,
+			      -count[LRU_ACTIVE_ANON]);
+	__mod_zone_page_state(zone, NR_INACTIVE_ANON,
+			      -count[LRU_INACTIVE_ANON]);
+
+	*nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+	*nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, *nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, *nr_file);
+
+	reclaim_stat->recent_scanned[0] += *nr_anon;
+	reclaim_stat->recent_scanned[1] += *nr_file;
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
@@ -1207,10 +1240,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	LIST_HEAD(page_list);
 	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 	unsigned long nr_taken;
 	unsigned long nr_active;
-	unsigned int count[NR_LRU_LISTS] = { 0, };
 	unsigned long nr_anon;
 	unsigned long nr_file;
 
@@ -1255,25 +1286,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		return 0;
 	}
 
-	nr_active = clear_active_flags(&page_list, count);
-	__count_vm_events(PGDEACTIVATE, nr_active);
-
-	__mod_zone_page_state(zone, NR_ACTIVE_FILE,
-					-count[LRU_ACTIVE_FILE]);
-	__mod_zone_page_state(zone, NR_INACTIVE_FILE,
-					-count[LRU_INACTIVE_FILE]);
-	__mod_zone_page_state(zone, NR_ACTIVE_ANON,
-					-count[LRU_ACTIVE_ANON]);
-	__mod_zone_page_state(zone, NR_INACTIVE_ANON,
-					-count[LRU_INACTIVE_ANON]);
-
-	nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
-	nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
-	__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
-
-	reclaim_stat->recent_scanned[0] += nr_anon;
-	reclaim_stat->recent_scanned[1] += nr_file;
+	update_isolated_counts(zone, sc, &nr_anon, &nr_file, &page_list);
 
 	spin_unlock_irq(&zone->lru_lock);
 
@@ -1292,7 +1305,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		 * The attempt at page out may have made some
 		 * of the pages active, mark them inactive again.
 		 */
-		nr_active = clear_active_flags(&page_list, count);
+		nr_active = clear_active_flags(&page_list, NULL);
 		count_vm_events(PGDEACTIVATE, nr_active);
 
 		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
@@ -1303,7 +1316,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-	putback_lru_pages(zone, reclaim_stat, nr_anon, nr_file, &page_list);
+	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
 	return nr_reclaimed;
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 11/14] vmscan: Update isolated page counters outside of main path in shrink_inactive_list()
@ 2010-06-29 11:34   ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

When shrink_inactive_list() isolates pages, it updates a number of
counters using temporary variables to gather them. These consume stack
and it's in the main path that calls ->writepage(). This patch moves the
accounting updates outside of the main path to reduce stack usage.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   63 +++++++++++++++++++++++++++++++++++-----------------------
 1 files changed, 38 insertions(+), 25 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1107830..efa6ee4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1072,7 +1072,8 @@ static unsigned long clear_active_flags(struct list_head *page_list,
 			ClearPageActive(page);
 			nr_active++;
 		}
-		count[lru]++;
+		if (count)
+			count[lru]++;
 	}
 
 	return nr_active;
@@ -1152,12 +1153,13 @@ static int too_many_isolated(struct zone *zone, int file,
  * TODO: Try merging with migrations version of putback_lru_pages
  */
 static noinline_for_stack void
-putback_lru_pages(struct zone *zone, struct zone_reclaim_stat *reclaim_stat,
+putback_lru_pages(struct zone *zone, struct scan_control *sc,
 				unsigned long nr_anon, unsigned long nr_file,
 				struct list_head *page_list)
 {
 	struct page *page;
 	struct pagevec pvec;
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 
 	pagevec_init(&pvec, 1);
 
@@ -1196,6 +1198,37 @@ putback_lru_pages(struct zone *zone, struct zone_reclaim_stat *reclaim_stat,
 	pagevec_release(&pvec);
 }
 
+static noinline_for_stack void update_isolated_counts(struct zone *zone,
+					struct scan_control *sc,
+					unsigned long *nr_anon,
+					unsigned long *nr_file,
+					struct list_head *isolated_list)
+{
+	unsigned long nr_active;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+
+	nr_active = clear_active_flags(isolated_list, count);
+	__count_vm_events(PGDEACTIVATE, nr_active);
+
+	__mod_zone_page_state(zone, NR_ACTIVE_FILE,
+			      -count[LRU_ACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_INACTIVE_FILE,
+			      -count[LRU_INACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_ACTIVE_ANON,
+			      -count[LRU_ACTIVE_ANON]);
+	__mod_zone_page_state(zone, NR_INACTIVE_ANON,
+			      -count[LRU_INACTIVE_ANON]);
+
+	*nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+	*nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, *nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, *nr_file);
+
+	reclaim_stat->recent_scanned[0] += *nr_anon;
+	reclaim_stat->recent_scanned[1] += *nr_file;
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
@@ -1207,10 +1240,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	LIST_HEAD(page_list);
 	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 	unsigned long nr_taken;
 	unsigned long nr_active;
-	unsigned int count[NR_LRU_LISTS] = { 0, };
 	unsigned long nr_anon;
 	unsigned long nr_file;
 
@@ -1255,25 +1286,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		return 0;
 	}
 
-	nr_active = clear_active_flags(&page_list, count);
-	__count_vm_events(PGDEACTIVATE, nr_active);
-
-	__mod_zone_page_state(zone, NR_ACTIVE_FILE,
-					-count[LRU_ACTIVE_FILE]);
-	__mod_zone_page_state(zone, NR_INACTIVE_FILE,
-					-count[LRU_INACTIVE_FILE]);
-	__mod_zone_page_state(zone, NR_ACTIVE_ANON,
-					-count[LRU_ACTIVE_ANON]);
-	__mod_zone_page_state(zone, NR_INACTIVE_ANON,
-					-count[LRU_INACTIVE_ANON]);
-
-	nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
-	nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
-	__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
-
-	reclaim_stat->recent_scanned[0] += nr_anon;
-	reclaim_stat->recent_scanned[1] += nr_file;
+	update_isolated_counts(zone, sc, &nr_anon, &nr_file, &page_list);
 
 	spin_unlock_irq(&zone->lru_lock);
 
@@ -1292,7 +1305,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		 * The attempt at page out may have made some
 		 * of the pages active, mark them inactive again.
 		 */
-		nr_active = clear_active_flags(&page_list, count);
+		nr_active = clear_active_flags(&page_list, NULL);
 		count_vm_events(PGDEACTIVATE, nr_active);
 
 		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
@@ -1303,7 +1316,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-	putback_lru_pages(zone, reclaim_stat, nr_anon, nr_file, &page_list);
+	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
 	return nr_reclaimed;
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
  2010-06-29 11:34 ` Mel Gorman
@ 2010-06-29 11:34   ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

When memory is under enough pressure, a process may enter direct
reclaim to free pages in the same manner kswapd does. If a dirty page is
encountered during the scan, this page is written to backing storage using
mapping->writepage. This can result in very deep call stacks, particularly
if the target storage or filesystem are complex. It has already been observed
on XFS that the stack overflows but the problem is not XFS-specific.

This patch prevents direct reclaim writing back pages by not setting
may_writepage in scan_control. Instead, dirty pages are placed back on the
LRU lists for either background writing by the BDI threads or kswapd. If
in direct lumpy reclaim and dirty pages are encountered, the process will
stall for the background flusher before trying to reclaim the pages again.

Memory control groups do not have a kswapd-like thread nor do pages get
direct reclaimed from the page allocator. Instead, memory control group
pages are reclaimed when the quota is being exceeded or the group is being
shrunk. As it is not expected that the entry points into page reclaim are
deep call chains memcg is still allowed to writeback dirty pages.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |  158 ++++++++++++++++++++++++++++++++++++++++-------------------
 1 files changed, 108 insertions(+), 50 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index efa6ee4..d5a2e74 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -323,6 +323,56 @@ typedef enum {
 	PAGE_CLEAN,
 } pageout_t;
 
+int write_reclaim_page(struct page *page, struct address_space *mapping,
+						enum pageout_io sync_writeback)
+{
+	int res;
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = SWAP_CLUSTER_MAX,
+		.range_start = 0,
+		.range_end = LLONG_MAX,
+		.nonblocking = 1,
+		.for_reclaim = 1,
+	};
+
+	if (!clear_page_dirty_for_io(page))
+		return PAGE_CLEAN;
+
+	SetPageReclaim(page);
+	res = mapping->a_ops->writepage(page, &wbc);
+	if (res < 0)
+		handle_write_error(mapping, page, res);
+	if (res == AOP_WRITEPAGE_ACTIVATE) {
+		ClearPageReclaim(page);
+		return PAGE_ACTIVATE;
+	}
+
+	/*
+	 * Wait on writeback if requested to. This happens when
+	 * direct reclaiming a large contiguous area and the
+	 * first attempt to free a range of pages fails.
+	 */
+	if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
+		wait_on_page_writeback(page);
+
+	if (!PageWriteback(page)) {
+		/* synchronous write or broken a_ops? */
+		ClearPageReclaim(page);
+	}
+	trace_mm_vmscan_writepage(page,
+		sync_writeback == PAGEOUT_IO_SYNC);
+	inc_zone_page_state(page, NR_VMSCAN_WRITE);
+
+	return PAGE_SUCCESS;
+}
+
+/* kswapd and memcg can writeback as they are unlikely to overflow stack */
+static inline bool reclaim_can_writeback(struct scan_control *sc)
+{
+	return current_is_kswapd() || sc->mem_cgroup != NULL;
+}
+
 /*
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
@@ -367,45 +417,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 	if (!may_write_to_queue(mapping->backing_dev_info))
 		return PAGE_KEEP;
 
-	if (clear_page_dirty_for_io(page)) {
-		int res;
-		struct writeback_control wbc = {
-			.sync_mode = WB_SYNC_NONE,
-			.nr_to_write = SWAP_CLUSTER_MAX,
-			.range_start = 0,
-			.range_end = LLONG_MAX,
-			.nonblocking = 1,
-			.for_reclaim = 1,
-		};
-
-		SetPageReclaim(page);
-		res = mapping->a_ops->writepage(page, &wbc);
-		if (res < 0)
-			handle_write_error(mapping, page, res);
-		if (res == AOP_WRITEPAGE_ACTIVATE) {
-			ClearPageReclaim(page);
-			return PAGE_ACTIVATE;
-		}
-
-		/*
-		 * Wait on writeback if requested to. This happens when
-		 * direct reclaiming a large contiguous area and the
-		 * first attempt to free a range of pages fails.
-		 */
-		if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
-			wait_on_page_writeback(page);
-
-		if (!PageWriteback(page)) {
-			/* synchronous write or broken a_ops? */
-			ClearPageReclaim(page);
-		}
-		trace_mm_vmscan_writepage(page,
-			sync_writeback == PAGEOUT_IO_SYNC);
-		inc_zone_page_state(page, NR_VMSCAN_WRITE);
-		return PAGE_SUCCESS;
-	}
-
-	return PAGE_CLEAN;
+	return write_reclaim_page(page, mapping, sync_writeback);
 }
 
 /*
@@ -638,6 +650,9 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
 	pagevec_free(&freed_pvec);
 }
 
+/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
+#define MAX_SWAP_CLEAN_WAIT 50
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -645,13 +660,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 					struct scan_control *sc,
 					enum pageout_io sync_writeback)
 {
-	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
-	int pgactivate = 0;
+	LIST_HEAD(putback_pages);
+	LIST_HEAD(dirty_pages);
+	int pgactivate;
+	int dirty_isolated = 0;
+	unsigned long nr_dirty;
 	unsigned long nr_reclaimed = 0;
 
+	pgactivate = 0;
 	cond_resched();
 
+restart_dirty:
+	nr_dirty = 0;
 	while (!list_empty(page_list)) {
 		enum page_references references;
 		struct address_space *mapping;
@@ -740,7 +761,20 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (PageDirty(page)) {
+		if (PageDirty(page))  {
+			/*
+			 * If the caller cannot writeback pages, dirty pages are
+			 * put on a separate list for cleaning by either a flusher
+			 * thread or kswapd
+			 */
+			if (!reclaim_can_writeback(sc) &&
+					dirty_isolated < MAX_SWAP_CLEAN_WAIT) {
+				list_add(&page->lru, &dirty_pages);
+				unlock_page(page);
+				nr_dirty++;
+				goto keep_dirty;
+			}
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
@@ -851,13 +885,38 @@ activate_locked:
 keep_locked:
 		unlock_page(page);
 keep:
-		list_add(&page->lru, &ret_pages);
+		list_add(&page->lru, &putback_pages);
+keep_dirty:
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
 
+	if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
+		/*
+		 * Wakeup a flusher thread to clean at least as many dirty
+		 * pages as encountered by direct reclaim. Wait on congestion
+		 * to throttle processes cleaning dirty pages
+		 */
+		wakeup_flusher_threads(nr_dirty);
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
+
+		/*
+		 * As lumpy reclaim targets specific pages, wait on them
+		 * to be cleaned and try reclaim again for a time.
+		 */
+		if (sync_writeback == PAGEOUT_IO_SYNC) {
+			dirty_isolated++;
+			list_splice(&dirty_pages, page_list);
+			INIT_LIST_HEAD(&dirty_pages);
+			goto restart_dirty;
+		}
+	}
+
 	free_page_list(&free_pages);
 
-	list_splice(&ret_pages, page_list);
+	if (!list_empty(&dirty_pages))
+		list_splice(&dirty_pages, page_list);
+	list_splice(&putback_pages, page_list);
+
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
@@ -1866,10 +1925,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 * writeout.  So in laptop mode, write out the whole world.
 		 */
 		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
-		if (total_scanned > writeback_threshold) {
+		if (total_scanned > writeback_threshold)
 			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
-			sc->may_writepage = 1;
-		}
 
 		/* Take a nap, wait for some writeback to complete */
 		if (!sc->hibernation_mode && sc->nr_scanned &&
@@ -1907,7 +1964,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
-		.may_writepage = !laptop_mode,
+		.may_writepage = 0,
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.may_unmap = 1,
 		.may_swap = 1,
@@ -1936,7 +1993,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						struct zone *zone, int nid)
 {
 	struct scan_control sc = {
-		.may_writepage = !laptop_mode,
+		.may_writepage = 0,
 		.may_unmap = 1,
 		.may_swap = !noswap,
 		.swappiness = swappiness,
@@ -2588,7 +2645,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	struct reclaim_state reclaim_state;
 	int priority;
 	struct scan_control sc = {
-		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
+		.may_writepage = (current_is_kswapd() &&
+					(zone_reclaim_mode & RECLAIM_WRITE)),
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
 		.may_swap = 1,
 		.nr_to_reclaim = max_t(unsigned long, nr_pages,
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-29 11:34   ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

When memory is under enough pressure, a process may enter direct
reclaim to free pages in the same manner kswapd does. If a dirty page is
encountered during the scan, this page is written to backing storage using
mapping->writepage. This can result in very deep call stacks, particularly
if the target storage or filesystem are complex. It has already been observed
on XFS that the stack overflows but the problem is not XFS-specific.

This patch prevents direct reclaim writing back pages by not setting
may_writepage in scan_control. Instead, dirty pages are placed back on the
LRU lists for either background writing by the BDI threads or kswapd. If
in direct lumpy reclaim and dirty pages are encountered, the process will
stall for the background flusher before trying to reclaim the pages again.

Memory control groups do not have a kswapd-like thread nor do pages get
direct reclaimed from the page allocator. Instead, memory control group
pages are reclaimed when the quota is being exceeded or the group is being
shrunk. As it is not expected that the entry points into page reclaim are
deep call chains memcg is still allowed to writeback dirty pages.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |  158 ++++++++++++++++++++++++++++++++++++++++-------------------
 1 files changed, 108 insertions(+), 50 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index efa6ee4..d5a2e74 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -323,6 +323,56 @@ typedef enum {
 	PAGE_CLEAN,
 } pageout_t;
 
+int write_reclaim_page(struct page *page, struct address_space *mapping,
+						enum pageout_io sync_writeback)
+{
+	int res;
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = SWAP_CLUSTER_MAX,
+		.range_start = 0,
+		.range_end = LLONG_MAX,
+		.nonblocking = 1,
+		.for_reclaim = 1,
+	};
+
+	if (!clear_page_dirty_for_io(page))
+		return PAGE_CLEAN;
+
+	SetPageReclaim(page);
+	res = mapping->a_ops->writepage(page, &wbc);
+	if (res < 0)
+		handle_write_error(mapping, page, res);
+	if (res == AOP_WRITEPAGE_ACTIVATE) {
+		ClearPageReclaim(page);
+		return PAGE_ACTIVATE;
+	}
+
+	/*
+	 * Wait on writeback if requested to. This happens when
+	 * direct reclaiming a large contiguous area and the
+	 * first attempt to free a range of pages fails.
+	 */
+	if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
+		wait_on_page_writeback(page);
+
+	if (!PageWriteback(page)) {
+		/* synchronous write or broken a_ops? */
+		ClearPageReclaim(page);
+	}
+	trace_mm_vmscan_writepage(page,
+		sync_writeback == PAGEOUT_IO_SYNC);
+	inc_zone_page_state(page, NR_VMSCAN_WRITE);
+
+	return PAGE_SUCCESS;
+}
+
+/* kswapd and memcg can writeback as they are unlikely to overflow stack */
+static inline bool reclaim_can_writeback(struct scan_control *sc)
+{
+	return current_is_kswapd() || sc->mem_cgroup != NULL;
+}
+
 /*
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
@@ -367,45 +417,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 	if (!may_write_to_queue(mapping->backing_dev_info))
 		return PAGE_KEEP;
 
-	if (clear_page_dirty_for_io(page)) {
-		int res;
-		struct writeback_control wbc = {
-			.sync_mode = WB_SYNC_NONE,
-			.nr_to_write = SWAP_CLUSTER_MAX,
-			.range_start = 0,
-			.range_end = LLONG_MAX,
-			.nonblocking = 1,
-			.for_reclaim = 1,
-		};
-
-		SetPageReclaim(page);
-		res = mapping->a_ops->writepage(page, &wbc);
-		if (res < 0)
-			handle_write_error(mapping, page, res);
-		if (res == AOP_WRITEPAGE_ACTIVATE) {
-			ClearPageReclaim(page);
-			return PAGE_ACTIVATE;
-		}
-
-		/*
-		 * Wait on writeback if requested to. This happens when
-		 * direct reclaiming a large contiguous area and the
-		 * first attempt to free a range of pages fails.
-		 */
-		if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
-			wait_on_page_writeback(page);
-
-		if (!PageWriteback(page)) {
-			/* synchronous write or broken a_ops? */
-			ClearPageReclaim(page);
-		}
-		trace_mm_vmscan_writepage(page,
-			sync_writeback == PAGEOUT_IO_SYNC);
-		inc_zone_page_state(page, NR_VMSCAN_WRITE);
-		return PAGE_SUCCESS;
-	}
-
-	return PAGE_CLEAN;
+	return write_reclaim_page(page, mapping, sync_writeback);
 }
 
 /*
@@ -638,6 +650,9 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
 	pagevec_free(&freed_pvec);
 }
 
+/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
+#define MAX_SWAP_CLEAN_WAIT 50
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -645,13 +660,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 					struct scan_control *sc,
 					enum pageout_io sync_writeback)
 {
-	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
-	int pgactivate = 0;
+	LIST_HEAD(putback_pages);
+	LIST_HEAD(dirty_pages);
+	int pgactivate;
+	int dirty_isolated = 0;
+	unsigned long nr_dirty;
 	unsigned long nr_reclaimed = 0;
 
+	pgactivate = 0;
 	cond_resched();
 
+restart_dirty:
+	nr_dirty = 0;
 	while (!list_empty(page_list)) {
 		enum page_references references;
 		struct address_space *mapping;
@@ -740,7 +761,20 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (PageDirty(page)) {
+		if (PageDirty(page))  {
+			/*
+			 * If the caller cannot writeback pages, dirty pages are
+			 * put on a separate list for cleaning by either a flusher
+			 * thread or kswapd
+			 */
+			if (!reclaim_can_writeback(sc) &&
+					dirty_isolated < MAX_SWAP_CLEAN_WAIT) {
+				list_add(&page->lru, &dirty_pages);
+				unlock_page(page);
+				nr_dirty++;
+				goto keep_dirty;
+			}
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
@@ -851,13 +885,38 @@ activate_locked:
 keep_locked:
 		unlock_page(page);
 keep:
-		list_add(&page->lru, &ret_pages);
+		list_add(&page->lru, &putback_pages);
+keep_dirty:
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
 
+	if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
+		/*
+		 * Wakeup a flusher thread to clean at least as many dirty
+		 * pages as encountered by direct reclaim. Wait on congestion
+		 * to throttle processes cleaning dirty pages
+		 */
+		wakeup_flusher_threads(nr_dirty);
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
+
+		/*
+		 * As lumpy reclaim targets specific pages, wait on them
+		 * to be cleaned and try reclaim again for a time.
+		 */
+		if (sync_writeback == PAGEOUT_IO_SYNC) {
+			dirty_isolated++;
+			list_splice(&dirty_pages, page_list);
+			INIT_LIST_HEAD(&dirty_pages);
+			goto restart_dirty;
+		}
+	}
+
 	free_page_list(&free_pages);
 
-	list_splice(&ret_pages, page_list);
+	if (!list_empty(&dirty_pages))
+		list_splice(&dirty_pages, page_list);
+	list_splice(&putback_pages, page_list);
+
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
@@ -1866,10 +1925,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 * writeout.  So in laptop mode, write out the whole world.
 		 */
 		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
-		if (total_scanned > writeback_threshold) {
+		if (total_scanned > writeback_threshold)
 			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
-			sc->may_writepage = 1;
-		}
 
 		/* Take a nap, wait for some writeback to complete */
 		if (!sc->hibernation_mode && sc->nr_scanned &&
@@ -1907,7 +1964,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
-		.may_writepage = !laptop_mode,
+		.may_writepage = 0,
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.may_unmap = 1,
 		.may_swap = 1,
@@ -1936,7 +1993,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						struct zone *zone, int nid)
 {
 	struct scan_control sc = {
-		.may_writepage = !laptop_mode,
+		.may_writepage = 0,
 		.may_unmap = 1,
 		.may_swap = !noswap,
 		.swappiness = swappiness,
@@ -2588,7 +2645,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	struct reclaim_state reclaim_state;
 	int priority;
 	struct scan_control sc = {
-		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
+		.may_writepage = (current_is_kswapd() &&
+					(zone_reclaim_mode & RECLAIM_WRITE)),
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
 		.may_swap = 1,
 		.nr_to_reclaim = max_t(unsigned long, nr_pages,
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 13/14] fs,btrfs: Allow kswapd to writeback pages
  2010-06-29 11:34 ` Mel Gorman
@ 2010-06-29 11:34   ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

As only kswapd and memcg are writing back pages, there should be no
danger of overflowing the stack. Allow the writing back of dirty pages
in btrfs from the VM.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 fs/btrfs/disk-io.c |   21 +--------------------
 1 files changed, 1 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 34f7c37..e4aa547 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -696,26 +696,7 @@ static int btree_writepage(struct page *page, struct writeback_control *wbc)
 	int was_dirty;
 
 	tree = &BTRFS_I(page->mapping->host)->io_tree;
-	if (!(current->flags & PF_MEMALLOC)) {
-		return extent_write_full_page(tree, page,
-					      btree_get_extent, wbc);
-	}
-
-	redirty_page_for_writepage(wbc, page);
-	eb = btrfs_find_tree_block(root, page_offset(page),
-				      PAGE_CACHE_SIZE);
-	WARN_ON(!eb);
-
-	was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->bflags);
-	if (!was_dirty) {
-		spin_lock(&root->fs_info->delalloc_lock);
-		root->fs_info->dirty_metadata_bytes += PAGE_CACHE_SIZE;
-		spin_unlock(&root->fs_info->delalloc_lock);
-	}
-	free_extent_buffer(eb);
-
-	unlock_page(page);
-	return 0;
+	return extent_write_full_page(tree, page, btree_get_extent, wbc);
 }
 
 static int btree_writepages(struct address_space *mapping,
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 13/14] fs,btrfs: Allow kswapd to writeback pages
@ 2010-06-29 11:34   ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

As only kswapd and memcg are writing back pages, there should be no
danger of overflowing the stack. Allow the writing back of dirty pages
in btrfs from the VM.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 fs/btrfs/disk-io.c |   21 +--------------------
 1 files changed, 1 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 34f7c37..e4aa547 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -696,26 +696,7 @@ static int btree_writepage(struct page *page, struct writeback_control *wbc)
 	int was_dirty;
 
 	tree = &BTRFS_I(page->mapping->host)->io_tree;
-	if (!(current->flags & PF_MEMALLOC)) {
-		return extent_write_full_page(tree, page,
-					      btree_get_extent, wbc);
-	}
-
-	redirty_page_for_writepage(wbc, page);
-	eb = btrfs_find_tree_block(root, page_offset(page),
-				      PAGE_CACHE_SIZE);
-	WARN_ON(!eb);
-
-	was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->bflags);
-	if (!was_dirty) {
-		spin_lock(&root->fs_info->delalloc_lock);
-		root->fs_info->dirty_metadata_bytes += PAGE_CACHE_SIZE;
-		spin_unlock(&root->fs_info->delalloc_lock);
-	}
-	free_extent_buffer(eb);
-
-	unlock_page(page);
-	return 0;
+	return extent_write_full_page(tree, page, btree_get_extent, wbc);
 }
 
 static int btree_writepages(struct address_space *mapping,
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 14/14] fs,xfs: Allow kswapd to writeback pages
  2010-06-29 11:34 ` Mel Gorman
@ 2010-06-29 11:34   ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

As only kswapd and memcg are writing back pages, there should be no
danger of overflowing the stack. Allow the writing back of dirty pages
in xfs from the VM.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 fs/xfs/linux-2.6/xfs_aops.c |   15 ---------------
 1 files changed, 0 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index 34640d6..4c89db3 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -1333,21 +1333,6 @@ xfs_vm_writepage(
 	trace_xfs_writepage(inode, page, 0);
 
 	/*
-	 * Refuse to write the page out if we are called from reclaim context.
-	 *
-	 * This is primarily to avoid stack overflows when called from deep
-	 * used stacks in random callers for direct reclaim, but disabling
-	 * reclaim for kswap is a nice side-effect as kswapd causes rather
-	 * suboptimal I/O patters, too.
-	 *
-	 * This should really be done by the core VM, but until that happens
-	 * filesystems like XFS, btrfs and ext4 have to take care of this
-	 * by themselves.
-	 */
-	if (current->flags & PF_MEMALLOC)
-		goto out_fail;
-
-	/*
 	 * We need a transaction if:
 	 *  1. There are delalloc buffers on the page
 	 *  2. The page is uptodate and we have unmapped buffers
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 14/14] fs,xfs: Allow kswapd to writeback pages
@ 2010-06-29 11:34   ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

As only kswapd and memcg are writing back pages, there should be no
danger of overflowing the stack. Allow the writing back of dirty pages
in xfs from the VM.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 fs/xfs/linux-2.6/xfs_aops.c |   15 ---------------
 1 files changed, 0 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index 34640d6..4c89db3 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -1333,21 +1333,6 @@ xfs_vm_writepage(
 	trace_xfs_writepage(inode, page, 0);
 
 	/*
-	 * Refuse to write the page out if we are called from reclaim context.
-	 *
-	 * This is primarily to avoid stack overflows when called from deep
-	 * used stacks in random callers for direct reclaim, but disabling
-	 * reclaim for kswap is a nice side-effect as kswapd causes rather
-	 * suboptimal I/O patters, too.
-	 *
-	 * This should really be done by the core VM, but until that happens
-	 * filesystems like XFS, btrfs and ext4 have to take care of this
-	 * by themselves.
-	 */
-	if (current->flags & PF_MEMALLOC)
-		goto out_fail;
-
-	/*
 	 * We need a transaction if:
 	 *  1. There are delalloc buffers on the page
 	 *  2. The page is uptodate and we have unmapped buffers
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 14/14] fs,xfs: Allow kswapd to writeback pages
  2010-06-29 11:34   ` Mel Gorman
@ 2010-06-29 12:37     ` Christoph Hellwig
  -1 siblings, 0 replies; 105+ messages in thread
From: Christoph Hellwig @ 2010-06-29 12:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

I don't see a patch in this set which refuses writeback from the memcg
context, which we identified as having large stack footprint in hte
discussion of the last patch set.

Meanwhile I've submitted a patch to xfs to allow reclaim from kswapd,
and just prevent it from direct and memcg reclaim.

Btw, it might be worth to also allow kswap to all writeout on ext4,
but doing that will be a bit more complicated than the btrfs and xfs
variants as the code is rather convoluted.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 14/14] fs,xfs: Allow kswapd to writeback pages
@ 2010-06-29 12:37     ` Christoph Hellwig
  0 siblings, 0 replies; 105+ messages in thread
From: Christoph Hellwig @ 2010-06-29 12:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

I don't see a patch in this set which refuses writeback from the memcg
context, which we identified as having large stack footprint in hte
discussion of the last patch set.

Meanwhile I've submitted a patch to xfs to allow reclaim from kswapd,
and just prevent it from direct and memcg reclaim.

Btw, it might be worth to also allow kswap to all writeout on ext4,
but doing that will be a bit more complicated than the btrfs and xfs
variants as the code is rather convoluted.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 14/14] fs,xfs: Allow kswapd to writeback pages
  2010-06-29 12:37     ` Christoph Hellwig
@ 2010-06-29 12:51       ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 12:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli

On Tue, Jun 29, 2010 at 08:37:22AM -0400, Christoph Hellwig wrote:
> I don't see a patch in this set which refuses writeback from the memcg
> context, which we identified as having large stack footprint in hte
> discussion of the last patch set.
> 

It wasn't clear to me what the right approach was there and should
have noted that in the intro. The last note I have on it is this message
http://kerneltrap.org/mailarchive/linux-kernel/2010/6/17/4584087 which might
avoid the deep stack usage but I wasn't 100% sure. As kswapd doesn't clean
pages for memcg, I left memcg being able to direct writeback to see what
the memcg people preferred.

> Meanwhile I've submitted a patch to xfs to allow reclaim from kswapd,
> and just prevent it from direct and memcg reclaim.
> 

Good stuff.

> Btw, it might be worth to also allow kswap to all writeout on ext4,
> but doing that will be a bit more complicated than the btrfs and xfs
> variants as the code is rather convoluted.
> 

Fully agreed. I looked into it and got caught in its twisty web so
postponed it until this much can be finalised, agreed upon or rejected -
all pre-requisities to making the ext4 work worthwhile.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 14/14] fs,xfs: Allow kswapd to writeback pages
@ 2010-06-29 12:51       ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 12:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli

On Tue, Jun 29, 2010 at 08:37:22AM -0400, Christoph Hellwig wrote:
> I don't see a patch in this set which refuses writeback from the memcg
> context, which we identified as having large stack footprint in hte
> discussion of the last patch set.
> 

It wasn't clear to me what the right approach was there and should
have noted that in the intro. The last note I have on it is this message
http://kerneltrap.org/mailarchive/linux-kernel/2010/6/17/4584087 which might
avoid the deep stack usage but I wasn't 100% sure. As kswapd doesn't clean
pages for memcg, I left memcg being able to direct writeback to see what
the memcg people preferred.

> Meanwhile I've submitted a patch to xfs to allow reclaim from kswapd,
> and just prevent it from direct and memcg reclaim.
> 

Good stuff.

> Btw, it might be worth to also allow kswap to all writeout on ext4,
> but doing that will be a bit more complicated than the btrfs and xfs
> variants as the code is rather convoluted.
> 

Fully agreed. I looked into it and got caught in its twisty web so
postponed it until this much can be finalised, agreed upon or rejected -
all pre-requisities to making the ext4 work worthwhile.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 01/14] vmscan: Fix mapping use after free
  2010-06-29 11:34   ` Mel Gorman
@ 2010-06-29 14:27     ` Minchan Kim
  -1 siblings, 0 replies; 105+ messages in thread
From: Minchan Kim @ 2010-06-29 14:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jun 29, 2010 at 8:34 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> From: Nick Piggin <npiggin@suse.de>
>
> Use lock_page_nosync in handle_write_error as after writepage we have no
> reference to the mapping when taking the page lock.
>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Trivial.
Please modify description of the function if you have a next turn.
"run sleeping lock_page()" -> "run sleeping lock_page_nosync"


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 01/14] vmscan: Fix mapping use after free
@ 2010-06-29 14:27     ` Minchan Kim
  0 siblings, 0 replies; 105+ messages in thread
From: Minchan Kim @ 2010-06-29 14:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jun 29, 2010 at 8:34 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> From: Nick Piggin <npiggin@suse.de>
>
> Use lock_page_nosync in handle_write_error as after writepage we have no
> reference to the mapping when taking the page lock.
>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Trivial.
Please modify description of the function if you have a next turn.
"run sleeping lock_page()" -> "run sleeping lock_page_nosync"


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 01/14] vmscan: Fix mapping use after free
  2010-06-29 11:34   ` Mel Gorman
@ 2010-06-29 14:44     ` Johannes Weiner
  -1 siblings, 0 replies; 105+ messages in thread
From: Johannes Weiner @ 2010-06-29 14:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli

On Tue, Jun 29, 2010 at 12:34:35PM +0100, Mel Gorman wrote:
> From: Nick Piggin <npiggin@suse.de>
> 
> Use lock_page_nosync in handle_write_error as after writepage we have no
> reference to the mapping when taking the page lock.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 01/14] vmscan: Fix mapping use after free
@ 2010-06-29 14:44     ` Johannes Weiner
  0 siblings, 0 replies; 105+ messages in thread
From: Johannes Weiner @ 2010-06-29 14:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli

On Tue, Jun 29, 2010 at 12:34:35PM +0100, Mel Gorman wrote:
> From: Nick Piggin <npiggin@suse.de>
> 
> Use lock_page_nosync in handle_write_error as after writepage we have no
> reference to the mapping when taking the page lock.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 14/14] fs,xfs: Allow kswapd to writeback pages
  2010-06-29 12:51       ` Mel Gorman
@ 2010-06-30  0:14         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 105+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-30  0:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, 29 Jun 2010 13:51:43 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Tue, Jun 29, 2010 at 08:37:22AM -0400, Christoph Hellwig wrote:
> > I don't see a patch in this set which refuses writeback from the memcg
> > context, which we identified as having large stack footprint in hte
> > discussion of the last patch set.
> > 
> 
> It wasn't clear to me what the right approach was there and should
> have noted that in the intro. The last note I have on it is this message
> http://kerneltrap.org/mailarchive/linux-kernel/2010/6/17/4584087 which might
> avoid the deep stack usage but I wasn't 100% sure. As kswapd doesn't clean
> pages for memcg, I left memcg being able to direct writeback to see what
> the memcg people preferred.
> 

Hmm. If some filesystems don't support direct ->writeback, memcg shouldn't
depends on it. If so, memcg should depends on some writeback-thread (as kswapd).
ok.

Then, my concern here is that which kswapd we should wake up and how it can stop.
IOW, how kswapd can know a memcg has some remaining writeback and struck on it.

One idea is here. (this patch will not work...not tested at all.)
If we can have "victim page list" and kswapd can depend on it to know
"which pages should be written", kswapd can know when it should work.

cpu usage by memcg will be a new problem...but...

==
Add a new LRU "CLEANING" and make kswapd launder it.
This patch also changes PG_reclaim behavior. New PG_reclaim works
as
 - If PG_reclaim is set, a page is on CLEAINING LIST.

And when kswapd launder a page
 - issue an writeback. (I'm thinking whehter I should put this
   cleaned page back to CLEANING lru and free it later.) 
 - if it can free directly, free it.
This just use current shrink_list().

Maybe this patch itself inlcludes many bad point...

---
 fs/proc/meminfo.c         |    2 
 include/linux/mm_inline.h |    9 ++
 include/linux/mmzone.h    |    7 ++
 mm/filemap.c              |    3 
 mm/internal.h             |    1 
 mm/page-writeback.c       |    1 
 mm/page_io.c              |    1 
 mm/swap.c                 |   31 ++-------
 mm/vmscan.c               |  153 +++++++++++++++++++++++++++++++++++++++++++++-
 9 files changed, 176 insertions(+), 32 deletions(-)

Index: mmotm-0611/include/linux/mmzone.h
===================================================================
--- mmotm-0611.orig/include/linux/mmzone.h
+++ mmotm-0611/include/linux/mmzone.h
@@ -85,6 +85,7 @@ enum zone_stat_item {
 	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
 	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
 	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
+	NR_CLEANING,		/*  "     "     "   "       "         */
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
@@ -133,6 +134,7 @@ enum lru_list {
 	LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
 	LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
 	LRU_UNEVICTABLE,
+	LRU_CLEANING,
 	NR_LRU_LISTS
 };
 
@@ -155,6 +157,11 @@ static inline int is_unevictable_lru(enu
 	return (l == LRU_UNEVICTABLE);
 }
 
+static inline int is_cleaning_lru(enum lru_list l)
+{
+	return (l == LRU_CLEANING);
+}
+
 enum zone_watermarks {
 	WMARK_MIN,
 	WMARK_LOW,
Index: mmotm-0611/include/linux/mm_inline.h
===================================================================
--- mmotm-0611.orig/include/linux/mm_inline.h
+++ mmotm-0611/include/linux/mm_inline.h
@@ -56,7 +56,10 @@ del_page_from_lru(struct zone *zone, str
 	enum lru_list l;
 
 	list_del(&page->lru);
-	if (PageUnevictable(page)) {
+	if (PageReclaim(page)) {
+		ClearPageReclaim(page);
+		l = LRU_CLEANING;
+	} else if (PageUnevictable(page)) {
 		__ClearPageUnevictable(page);
 		l = LRU_UNEVICTABLE;
 	} else {
@@ -81,7 +84,9 @@ static inline enum lru_list page_lru(str
 {
 	enum lru_list lru;
 
-	if (PageUnevictable(page))
+	if (PageReclaim(page)) {
+		lru = LRU_CLEANING;
+	} else if (PageUnevictable(page))
 		lru = LRU_UNEVICTABLE;
 	else {
 		lru = page_lru_base_type(page);
Index: mmotm-0611/fs/proc/meminfo.c
===================================================================
--- mmotm-0611.orig/fs/proc/meminfo.c
+++ mmotm-0611/fs/proc/meminfo.c
@@ -65,6 +65,7 @@ static int meminfo_proc_show(struct seq_
 		"Active(file):   %8lu kB\n"
 		"Inactive(file): %8lu kB\n"
 		"Unevictable:    %8lu kB\n"
+		"Cleaning:       %8lu kB\n"
 		"Mlocked:        %8lu kB\n"
 #ifdef CONFIG_HIGHMEM
 		"HighTotal:      %8lu kB\n"
@@ -114,6 +115,7 @@ static int meminfo_proc_show(struct seq_
 		K(pages[LRU_ACTIVE_FILE]),
 		K(pages[LRU_INACTIVE_FILE]),
 		K(pages[LRU_UNEVICTABLE]),
+		K(pages[LRU_CLEANING]),
 		K(global_page_state(NR_MLOCK)),
 #ifdef CONFIG_HIGHMEM
 		K(i.totalhigh),
Index: mmotm-0611/mm/swap.c
===================================================================
--- mmotm-0611.orig/mm/swap.c
+++ mmotm-0611/mm/swap.c
@@ -118,8 +118,8 @@ static void pagevec_move_tail(struct pag
 			zone = pagezone;
 			spin_lock(&zone->lru_lock);
 		}
-		if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
-			int lru = page_lru_base_type(page);
+		if (PageLRU(page)) {
+			int lru = page_lru(page);
 			list_move_tail(&page->lru, &zone->lru[lru].list);
 			pgmoved++;
 		}
@@ -131,27 +131,6 @@ static void pagevec_move_tail(struct pag
 	pagevec_reinit(pvec);
 }
 
-/*
- * Writeback is about to end against a page which has been marked for immediate
- * reclaim.  If it still appears to be reclaimable, move it to the tail of the
- * inactive list.
- */
-void  rotate_reclaimable_page(struct page *page)
-{
-	if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
-	    !PageUnevictable(page) && PageLRU(page)) {
-		struct pagevec *pvec;
-		unsigned long flags;
-
-		page_cache_get(page);
-		local_irq_save(flags);
-		pvec = &__get_cpu_var(lru_rotate_pvecs);
-		if (!pagevec_add(pvec, page))
-			pagevec_move_tail(pvec);
-		local_irq_restore(flags);
-	}
-}
-
 static void update_page_reclaim_stat(struct zone *zone, struct page *page,
 				     int file, int rotated)
 {
@@ -235,10 +214,16 @@ void lru_cache_add_lru(struct page *page
 {
 	if (PageActive(page)) {
 		VM_BUG_ON(PageUnevictable(page));
+		VM_BUG_ON(PageReclaim(page));
 		ClearPageActive(page);
 	} else if (PageUnevictable(page)) {
 		VM_BUG_ON(PageActive(page));
+		VM_BUG_ON(PageReclaim(page));
 		ClearPageUnevictable(page);
+	} else if (PageReclaim(page)) {
+		VM_BUG_ON(PageReclaim(page));
+		VM_BUG_ON(PageUnevictable(page));
+		ClearPageReclaim(page);
 	}
 
 	VM_BUG_ON(PageLRU(page) || PageActive(page) || PageUnevictable(page));
Index: mmotm-0611/mm/filemap.c
===================================================================
--- mmotm-0611.orig/mm/filemap.c
+++ mmotm-0611/mm/filemap.c
@@ -560,9 +560,6 @@ EXPORT_SYMBOL(unlock_page);
  */
 void end_page_writeback(struct page *page)
 {
-	if (TestClearPageReclaim(page))
-		rotate_reclaimable_page(page);
-
 	if (!test_clear_page_writeback(page))
 		BUG();
 
Index: mmotm-0611/mm/internal.h
===================================================================
--- mmotm-0611.orig/mm/internal.h
+++ mmotm-0611/mm/internal.h
@@ -259,3 +259,4 @@ extern u64 hwpoison_filter_flags_mask;
 extern u64 hwpoison_filter_flags_value;
 extern u64 hwpoison_filter_memcg;
 extern u32 hwpoison_filter_enable;
+
Index: mmotm-0611/mm/page-writeback.c
===================================================================
--- mmotm-0611.orig/mm/page-writeback.c
+++ mmotm-0611/mm/page-writeback.c
@@ -1252,7 +1252,6 @@ int clear_page_dirty_for_io(struct page 
 
 	BUG_ON(!PageLocked(page));
 
-	ClearPageReclaim(page);
 	if (mapping && mapping_cap_account_dirty(mapping)) {
 		/*
 		 * Yes, Virginia, this is indeed insane.
Index: mmotm-0611/mm/page_io.c
===================================================================
--- mmotm-0611.orig/mm/page_io.c
+++ mmotm-0611/mm/page_io.c
@@ -60,7 +60,6 @@ static void end_swap_bio_write(struct bi
 				imajor(bio->bi_bdev->bd_inode),
 				iminor(bio->bi_bdev->bd_inode),
 				(unsigned long long)bio->bi_sector);
-		ClearPageReclaim(page);
 	}
 	end_page_writeback(page);
 	bio_put(bio);
Index: mmotm-0611/mm/vmscan.c
===================================================================
--- mmotm-0611.orig/mm/vmscan.c
+++ mmotm-0611/mm/vmscan.c
@@ -364,6 +364,12 @@ static pageout_t pageout(struct page *pa
 	if (!may_write_to_queue(mapping->backing_dev_info))
 		return PAGE_KEEP;
 
+	if (!current_is_kswapd()) {
+		/* pass this page to kswapd. */
+		SetPageReclaim(page);
+		return PAGE_KEEP;
+	}
+
 	if (clear_page_dirty_for_io(page)) {
 		int res;
 		struct writeback_control wbc = {
@@ -503,6 +509,8 @@ void putback_lru_page(struct page *page)
 
 redo:
 	ClearPageUnevictable(page);
+	/* This function never puts pages to CLEANING queue */
+	ClearPageReclaim(page);
 
 	if (page_evictable(page, NULL)) {
 		/*
@@ -883,6 +891,8 @@ int __isolate_lru_page(struct page *page
 		 * page release code relies on it.
 		 */
 		ClearPageLRU(page);
+		/* when someone isolate this page, clear reclaim status */
+		ClearPageReclaim(page);
 		ret = 0;
 	}
 
@@ -1020,7 +1030,7 @@ static unsigned long isolate_pages_globa
  * # of pages of each types and clearing any active bits.
  */
 static unsigned long count_page_types(struct list_head *page_list,
-				unsigned int *count, int clear_active)
+				unsigned int *count, int clear_actives)
 {
 	int nr_active = 0;
 	int lru;
@@ -1076,6 +1086,7 @@ int isolate_lru_page(struct page *page)
 			int lru = page_lru(page);
 			ret = 0;
 			ClearPageLRU(page);
+			ClearPageReclaim(page);
 
 			del_page_from_lru_list(zone, page, lru);
 		}
@@ -1109,6 +1120,103 @@ static int too_many_isolated(struct zone
 	return isolated > inactive;
 }
 
+/* only called by kswapd to do I/O and put back clean paes to its LRU */
+static void shrink_cleaning_list(struct zone *zone)
+{
+	LIST_HEAD(page_list);
+	struct list_head *src;
+	struct pagevec pvec;
+	unsigned long nr_pageout;
+	unsigned long nr_cleaned;
+	struct scan_control sc = {
+		.gfp_mask = GFP_KERNEL,
+		.may_unmap = 1,
+		.may_swap = 1,
+		.nr_to_reclaim = ULONG_MAX,
+		.swappiness = vm_swappiness,
+		.order = 1,
+		.mem_cgroup = NULL,
+	};
+
+	pagevec_init(&pvec, 1);
+	lru_add_drain();
+
+	src = &zone->lru[LRU_CLEANING].list;
+	nr_pageout = 0;
+	nr_cleaned = 0;
+	spin_lock_irq(&zone->lru_lock);
+	do {
+		unsigned int count[NR_LRU_LISTS] = {0,};
+		unsigned int nr_anon, nr_file, nr_taken, check_clean, nr_freed;
+		unsigned long nr_scan;
+
+		if (list_empty(src))
+			goto done;
+
+		check_clean = max((unsigned long)SWAP_CLUSTER_MAX,
+				zone_page_state(zone, NR_CLEANING)/8);
+		/* we do global-only */
+		nr_taken = isolate_lru_pages(check_clean,
+					src, &page_list, &nr_scan,
+					0, ISOLATE_BOTH, 0);
+		zone->pages_scanned += nr_scan;
+		__count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scan);
+		if (nr_taken == 0)
+			goto done;
+		__mod_zone_page_state(zone, NR_CLEANING, -nr_taken);
+		spin_unlock_irq(&zone->lru_lock);
+		/*
+		 * Because PG_reclaim flag is deleted by isolate_lru_page(),
+		 * we can count correct value
+		 */
+		count_page_types(&page_list, count, 0);
+		nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+		nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+		__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
+		__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
+
+		nr_freed = shrink_page_list(&page_list, &sc, PAGEOUT_IO_ASYNC);
+		/*
+		 * Put back any unfreeable pages.
+		 */
+		while (!list_empty(&page_list)) {
+			int lru;
+			struct page *page;
+
+			page = lru_to_page(&page_list);
+			VM_BUG_ON(PageLRU(page));
+			list_del(&page->lru);
+			if (!unlikely(!page_evictable(page, NULL))) {
+				spin_unlock_irq(&zone->lru_lock);
+				putback_lru_page(page);
+				spin_lock_irq(&zone->lru_lock);
+				continue;
+			}
+			SetPageLRU(page);
+			lru = page_lru(page);
+			add_page_to_lru_list(zone, page, lru);
+			if (!pagevec_add(&pvec, page)) {
+				spin_unlock_irq(&zone->lru_lock);
+				__pagevec_release(&pvec);
+				spin_lock_irq(&zone->lru_lock);
+			}
+		}
+		__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
+		__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
+		nr_pageout += nr_taken - nr_freed;
+		nr_cleaned += nr_freed;
+		if (nr_pageout > SWAP_CLUSTER_MAX) {
+			/* there are remaining I/Os */
+			congestion_wait(BLK_RW_ASYNC, HZ/10);
+			nr_pageout /= 2;
+		}
+	} while(nr_cleaned < SWAP_CLUSTER_MAX);
+done:
+	spin_unlock_irq(&zone->lru_lock);
+	pagevec_release(&pvec);
+	return;
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
@@ -1736,6 +1844,9 @@ static bool shrink_zones(int priority, s
 					sc->nodemask) {
 		if (!populated_zone(zone))
 			continue;
+
+		if (current_is_kswapd())
+			shrink_cleaning_list(zone);
 		/*
 		 * Take care memory controller reclaiming has small influence
 		 * to global LRU.
@@ -2222,6 +2333,42 @@ out:
 	return sc.nr_reclaimed;
 }
 
+static void launder_pgdat(pg_data_t *pgdat)
+{
+	struct zone *zone;
+	int i;
+
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+
+		zone = &pgdat->node_zones[i];
+		if (!populated_zone(zone))
+			continue;
+		if (zone_page_state(zone, NR_CLEANING))
+			break;
+		shrink_cleaning_list(zone);
+	}
+}
+
+/*
+ * Find a zone which has cleaning list.
+ */
+static int need_to_cleaning_node(pg_data_t *pgdat)
+{
+	int i;
+	struct zone *zone;
+
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+
+		zone = &pgdat->node_zones[i];
+		if (!populated_zone(zone))
+			continue;
+		if (zone_page_state(zone, NR_CLEANING))
+			break;
+	}
+	return (i != MAX_NR_ZONES);
+}
+
+
 /*
  * The background pageout daemon, started as a kernel thread
  * from the init process.
@@ -2275,7 +2422,9 @@ static int kswapd(void *p)
 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 		new_order = pgdat->kswapd_max_order;
 		pgdat->kswapd_max_order = 0;
-		if (order < new_order) {
+		if (need_to_cleaning_node(pgdat)) {
+			launder_pgdat(pgdat);
+		} else if (order < new_order) {
 			/*
 			 * Don't sleep if someone wants a larger 'order'
 			 * allocation


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 14/14] fs,xfs: Allow kswapd to writeback pages
@ 2010-06-30  0:14         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 105+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-30  0:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, 29 Jun 2010 13:51:43 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Tue, Jun 29, 2010 at 08:37:22AM -0400, Christoph Hellwig wrote:
> > I don't see a patch in this set which refuses writeback from the memcg
> > context, which we identified as having large stack footprint in hte
> > discussion of the last patch set.
> > 
> 
> It wasn't clear to me what the right approach was there and should
> have noted that in the intro. The last note I have on it is this message
> http://kerneltrap.org/mailarchive/linux-kernel/2010/6/17/4584087 which might
> avoid the deep stack usage but I wasn't 100% sure. As kswapd doesn't clean
> pages for memcg, I left memcg being able to direct writeback to see what
> the memcg people preferred.
> 

Hmm. If some filesystems don't support direct ->writeback, memcg shouldn't
depends on it. If so, memcg should depends on some writeback-thread (as kswapd).
ok.

Then, my concern here is that which kswapd we should wake up and how it can stop.
IOW, how kswapd can know a memcg has some remaining writeback and struck on it.

One idea is here. (this patch will not work...not tested at all.)
If we can have "victim page list" and kswapd can depend on it to know
"which pages should be written", kswapd can know when it should work.

cpu usage by memcg will be a new problem...but...

==
Add a new LRU "CLEANING" and make kswapd launder it.
This patch also changes PG_reclaim behavior. New PG_reclaim works
as
 - If PG_reclaim is set, a page is on CLEAINING LIST.

And when kswapd launder a page
 - issue an writeback. (I'm thinking whehter I should put this
   cleaned page back to CLEANING lru and free it later.) 
 - if it can free directly, free it.
This just use current shrink_list().

Maybe this patch itself inlcludes many bad point...

---
 fs/proc/meminfo.c         |    2 
 include/linux/mm_inline.h |    9 ++
 include/linux/mmzone.h    |    7 ++
 mm/filemap.c              |    3 
 mm/internal.h             |    1 
 mm/page-writeback.c       |    1 
 mm/page_io.c              |    1 
 mm/swap.c                 |   31 ++-------
 mm/vmscan.c               |  153 +++++++++++++++++++++++++++++++++++++++++++++-
 9 files changed, 176 insertions(+), 32 deletions(-)

Index: mmotm-0611/include/linux/mmzone.h
===================================================================
--- mmotm-0611.orig/include/linux/mmzone.h
+++ mmotm-0611/include/linux/mmzone.h
@@ -85,6 +85,7 @@ enum zone_stat_item {
 	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
 	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
 	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
+	NR_CLEANING,		/*  "     "     "   "       "         */
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
@@ -133,6 +134,7 @@ enum lru_list {
 	LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
 	LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
 	LRU_UNEVICTABLE,
+	LRU_CLEANING,
 	NR_LRU_LISTS
 };
 
@@ -155,6 +157,11 @@ static inline int is_unevictable_lru(enu
 	return (l == LRU_UNEVICTABLE);
 }
 
+static inline int is_cleaning_lru(enum lru_list l)
+{
+	return (l == LRU_CLEANING);
+}
+
 enum zone_watermarks {
 	WMARK_MIN,
 	WMARK_LOW,
Index: mmotm-0611/include/linux/mm_inline.h
===================================================================
--- mmotm-0611.orig/include/linux/mm_inline.h
+++ mmotm-0611/include/linux/mm_inline.h
@@ -56,7 +56,10 @@ del_page_from_lru(struct zone *zone, str
 	enum lru_list l;
 
 	list_del(&page->lru);
-	if (PageUnevictable(page)) {
+	if (PageReclaim(page)) {
+		ClearPageReclaim(page);
+		l = LRU_CLEANING;
+	} else if (PageUnevictable(page)) {
 		__ClearPageUnevictable(page);
 		l = LRU_UNEVICTABLE;
 	} else {
@@ -81,7 +84,9 @@ static inline enum lru_list page_lru(str
 {
 	enum lru_list lru;
 
-	if (PageUnevictable(page))
+	if (PageReclaim(page)) {
+		lru = LRU_CLEANING;
+	} else if (PageUnevictable(page))
 		lru = LRU_UNEVICTABLE;
 	else {
 		lru = page_lru_base_type(page);
Index: mmotm-0611/fs/proc/meminfo.c
===================================================================
--- mmotm-0611.orig/fs/proc/meminfo.c
+++ mmotm-0611/fs/proc/meminfo.c
@@ -65,6 +65,7 @@ static int meminfo_proc_show(struct seq_
 		"Active(file):   %8lu kB\n"
 		"Inactive(file): %8lu kB\n"
 		"Unevictable:    %8lu kB\n"
+		"Cleaning:       %8lu kB\n"
 		"Mlocked:        %8lu kB\n"
 #ifdef CONFIG_HIGHMEM
 		"HighTotal:      %8lu kB\n"
@@ -114,6 +115,7 @@ static int meminfo_proc_show(struct seq_
 		K(pages[LRU_ACTIVE_FILE]),
 		K(pages[LRU_INACTIVE_FILE]),
 		K(pages[LRU_UNEVICTABLE]),
+		K(pages[LRU_CLEANING]),
 		K(global_page_state(NR_MLOCK)),
 #ifdef CONFIG_HIGHMEM
 		K(i.totalhigh),
Index: mmotm-0611/mm/swap.c
===================================================================
--- mmotm-0611.orig/mm/swap.c
+++ mmotm-0611/mm/swap.c
@@ -118,8 +118,8 @@ static void pagevec_move_tail(struct pag
 			zone = pagezone;
 			spin_lock(&zone->lru_lock);
 		}
-		if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
-			int lru = page_lru_base_type(page);
+		if (PageLRU(page)) {
+			int lru = page_lru(page);
 			list_move_tail(&page->lru, &zone->lru[lru].list);
 			pgmoved++;
 		}
@@ -131,27 +131,6 @@ static void pagevec_move_tail(struct pag
 	pagevec_reinit(pvec);
 }
 
-/*
- * Writeback is about to end against a page which has been marked for immediate
- * reclaim.  If it still appears to be reclaimable, move it to the tail of the
- * inactive list.
- */
-void  rotate_reclaimable_page(struct page *page)
-{
-	if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
-	    !PageUnevictable(page) && PageLRU(page)) {
-		struct pagevec *pvec;
-		unsigned long flags;
-
-		page_cache_get(page);
-		local_irq_save(flags);
-		pvec = &__get_cpu_var(lru_rotate_pvecs);
-		if (!pagevec_add(pvec, page))
-			pagevec_move_tail(pvec);
-		local_irq_restore(flags);
-	}
-}
-
 static void update_page_reclaim_stat(struct zone *zone, struct page *page,
 				     int file, int rotated)
 {
@@ -235,10 +214,16 @@ void lru_cache_add_lru(struct page *page
 {
 	if (PageActive(page)) {
 		VM_BUG_ON(PageUnevictable(page));
+		VM_BUG_ON(PageReclaim(page));
 		ClearPageActive(page);
 	} else if (PageUnevictable(page)) {
 		VM_BUG_ON(PageActive(page));
+		VM_BUG_ON(PageReclaim(page));
 		ClearPageUnevictable(page);
+	} else if (PageReclaim(page)) {
+		VM_BUG_ON(PageReclaim(page));
+		VM_BUG_ON(PageUnevictable(page));
+		ClearPageReclaim(page);
 	}
 
 	VM_BUG_ON(PageLRU(page) || PageActive(page) || PageUnevictable(page));
Index: mmotm-0611/mm/filemap.c
===================================================================
--- mmotm-0611.orig/mm/filemap.c
+++ mmotm-0611/mm/filemap.c
@@ -560,9 +560,6 @@ EXPORT_SYMBOL(unlock_page);
  */
 void end_page_writeback(struct page *page)
 {
-	if (TestClearPageReclaim(page))
-		rotate_reclaimable_page(page);
-
 	if (!test_clear_page_writeback(page))
 		BUG();
 
Index: mmotm-0611/mm/internal.h
===================================================================
--- mmotm-0611.orig/mm/internal.h
+++ mmotm-0611/mm/internal.h
@@ -259,3 +259,4 @@ extern u64 hwpoison_filter_flags_mask;
 extern u64 hwpoison_filter_flags_value;
 extern u64 hwpoison_filter_memcg;
 extern u32 hwpoison_filter_enable;
+
Index: mmotm-0611/mm/page-writeback.c
===================================================================
--- mmotm-0611.orig/mm/page-writeback.c
+++ mmotm-0611/mm/page-writeback.c
@@ -1252,7 +1252,6 @@ int clear_page_dirty_for_io(struct page 
 
 	BUG_ON(!PageLocked(page));
 
-	ClearPageReclaim(page);
 	if (mapping && mapping_cap_account_dirty(mapping)) {
 		/*
 		 * Yes, Virginia, this is indeed insane.
Index: mmotm-0611/mm/page_io.c
===================================================================
--- mmotm-0611.orig/mm/page_io.c
+++ mmotm-0611/mm/page_io.c
@@ -60,7 +60,6 @@ static void end_swap_bio_write(struct bi
 				imajor(bio->bi_bdev->bd_inode),
 				iminor(bio->bi_bdev->bd_inode),
 				(unsigned long long)bio->bi_sector);
-		ClearPageReclaim(page);
 	}
 	end_page_writeback(page);
 	bio_put(bio);
Index: mmotm-0611/mm/vmscan.c
===================================================================
--- mmotm-0611.orig/mm/vmscan.c
+++ mmotm-0611/mm/vmscan.c
@@ -364,6 +364,12 @@ static pageout_t pageout(struct page *pa
 	if (!may_write_to_queue(mapping->backing_dev_info))
 		return PAGE_KEEP;
 
+	if (!current_is_kswapd()) {
+		/* pass this page to kswapd. */
+		SetPageReclaim(page);
+		return PAGE_KEEP;
+	}
+
 	if (clear_page_dirty_for_io(page)) {
 		int res;
 		struct writeback_control wbc = {
@@ -503,6 +509,8 @@ void putback_lru_page(struct page *page)
 
 redo:
 	ClearPageUnevictable(page);
+	/* This function never puts pages to CLEANING queue */
+	ClearPageReclaim(page);
 
 	if (page_evictable(page, NULL)) {
 		/*
@@ -883,6 +891,8 @@ int __isolate_lru_page(struct page *page
 		 * page release code relies on it.
 		 */
 		ClearPageLRU(page);
+		/* when someone isolate this page, clear reclaim status */
+		ClearPageReclaim(page);
 		ret = 0;
 	}
 
@@ -1020,7 +1030,7 @@ static unsigned long isolate_pages_globa
  * # of pages of each types and clearing any active bits.
  */
 static unsigned long count_page_types(struct list_head *page_list,
-				unsigned int *count, int clear_active)
+				unsigned int *count, int clear_actives)
 {
 	int nr_active = 0;
 	int lru;
@@ -1076,6 +1086,7 @@ int isolate_lru_page(struct page *page)
 			int lru = page_lru(page);
 			ret = 0;
 			ClearPageLRU(page);
+			ClearPageReclaim(page);
 
 			del_page_from_lru_list(zone, page, lru);
 		}
@@ -1109,6 +1120,103 @@ static int too_many_isolated(struct zone
 	return isolated > inactive;
 }
 
+/* only called by kswapd to do I/O and put back clean paes to its LRU */
+static void shrink_cleaning_list(struct zone *zone)
+{
+	LIST_HEAD(page_list);
+	struct list_head *src;
+	struct pagevec pvec;
+	unsigned long nr_pageout;
+	unsigned long nr_cleaned;
+	struct scan_control sc = {
+		.gfp_mask = GFP_KERNEL,
+		.may_unmap = 1,
+		.may_swap = 1,
+		.nr_to_reclaim = ULONG_MAX,
+		.swappiness = vm_swappiness,
+		.order = 1,
+		.mem_cgroup = NULL,
+	};
+
+	pagevec_init(&pvec, 1);
+	lru_add_drain();
+
+	src = &zone->lru[LRU_CLEANING].list;
+	nr_pageout = 0;
+	nr_cleaned = 0;
+	spin_lock_irq(&zone->lru_lock);
+	do {
+		unsigned int count[NR_LRU_LISTS] = {0,};
+		unsigned int nr_anon, nr_file, nr_taken, check_clean, nr_freed;
+		unsigned long nr_scan;
+
+		if (list_empty(src))
+			goto done;
+
+		check_clean = max((unsigned long)SWAP_CLUSTER_MAX,
+				zone_page_state(zone, NR_CLEANING)/8);
+		/* we do global-only */
+		nr_taken = isolate_lru_pages(check_clean,
+					src, &page_list, &nr_scan,
+					0, ISOLATE_BOTH, 0);
+		zone->pages_scanned += nr_scan;
+		__count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scan);
+		if (nr_taken == 0)
+			goto done;
+		__mod_zone_page_state(zone, NR_CLEANING, -nr_taken);
+		spin_unlock_irq(&zone->lru_lock);
+		/*
+		 * Because PG_reclaim flag is deleted by isolate_lru_page(),
+		 * we can count correct value
+		 */
+		count_page_types(&page_list, count, 0);
+		nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+		nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+		__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
+		__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
+
+		nr_freed = shrink_page_list(&page_list, &sc, PAGEOUT_IO_ASYNC);
+		/*
+		 * Put back any unfreeable pages.
+		 */
+		while (!list_empty(&page_list)) {
+			int lru;
+			struct page *page;
+
+			page = lru_to_page(&page_list);
+			VM_BUG_ON(PageLRU(page));
+			list_del(&page->lru);
+			if (!unlikely(!page_evictable(page, NULL))) {
+				spin_unlock_irq(&zone->lru_lock);
+				putback_lru_page(page);
+				spin_lock_irq(&zone->lru_lock);
+				continue;
+			}
+			SetPageLRU(page);
+			lru = page_lru(page);
+			add_page_to_lru_list(zone, page, lru);
+			if (!pagevec_add(&pvec, page)) {
+				spin_unlock_irq(&zone->lru_lock);
+				__pagevec_release(&pvec);
+				spin_lock_irq(&zone->lru_lock);
+			}
+		}
+		__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
+		__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
+		nr_pageout += nr_taken - nr_freed;
+		nr_cleaned += nr_freed;
+		if (nr_pageout > SWAP_CLUSTER_MAX) {
+			/* there are remaining I/Os */
+			congestion_wait(BLK_RW_ASYNC, HZ/10);
+			nr_pageout /= 2;
+		}
+	} while(nr_cleaned < SWAP_CLUSTER_MAX);
+done:
+	spin_unlock_irq(&zone->lru_lock);
+	pagevec_release(&pvec);
+	return;
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
@@ -1736,6 +1844,9 @@ static bool shrink_zones(int priority, s
 					sc->nodemask) {
 		if (!populated_zone(zone))
 			continue;
+
+		if (current_is_kswapd())
+			shrink_cleaning_list(zone);
 		/*
 		 * Take care memory controller reclaiming has small influence
 		 * to global LRU.
@@ -2222,6 +2333,42 @@ out:
 	return sc.nr_reclaimed;
 }
 
+static void launder_pgdat(pg_data_t *pgdat)
+{
+	struct zone *zone;
+	int i;
+
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+
+		zone = &pgdat->node_zones[i];
+		if (!populated_zone(zone))
+			continue;
+		if (zone_page_state(zone, NR_CLEANING))
+			break;
+		shrink_cleaning_list(zone);
+	}
+}
+
+/*
+ * Find a zone which has cleaning list.
+ */
+static int need_to_cleaning_node(pg_data_t *pgdat)
+{
+	int i;
+	struct zone *zone;
+
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+
+		zone = &pgdat->node_zones[i];
+		if (!populated_zone(zone))
+			continue;
+		if (zone_page_state(zone, NR_CLEANING))
+			break;
+	}
+	return (i != MAX_NR_ZONES);
+}
+
+
 /*
  * The background pageout daemon, started as a kernel thread
  * from the init process.
@@ -2275,7 +2422,9 @@ static int kswapd(void *p)
 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 		new_order = pgdat->kswapd_max_order;
 		pgdat->kswapd_max_order = 0;
-		if (order < new_order) {
+		if (need_to_cleaning_node(pgdat)) {
+			launder_pgdat(pgdat);
+		} else if (order < new_order) {
 			/*
 			 * Don't sleep if someone wants a larger 'order'
 			 * allocation

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 13/14] fs,btrfs: Allow kswapd to writeback pages
  2010-06-29 11:34   ` Mel Gorman
@ 2010-06-30 13:05     ` Chris Mason
  -1 siblings, 0 replies; 105+ messages in thread
From: Chris Mason @ 2010-06-30 13:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jun 29, 2010 at 12:34:47PM +0100, Mel Gorman wrote:
> As only kswapd and memcg are writing back pages, there should be no
> danger of overflowing the stack. Allow the writing back of dirty pages
> in btrfs from the VM.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Signed-off-by: Chris Mason <chris.mason@oracle.com>

But, this is only the metadata writepage.  fs/btrfs/inode.c has another
one for data pages.  (just look for PF_MEMALLOC).

-chris

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 13/14] fs,btrfs: Allow kswapd to writeback pages
@ 2010-06-30 13:05     ` Chris Mason
  0 siblings, 0 replies; 105+ messages in thread
From: Chris Mason @ 2010-06-30 13:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jun 29, 2010 at 12:34:47PM +0100, Mel Gorman wrote:
> As only kswapd and memcg are writing back pages, there should be no
> danger of overflowing the stack. Allow the writing back of dirty pages
> in btrfs from the VM.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Signed-off-by: Chris Mason <chris.mason@oracle.com>

But, this is only the metadata writepage.  fs/btrfs/inode.c has another
one for data pages.  (just look for PF_MEMALLOC).

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 01/14] vmscan: Fix mapping use after free
  2010-06-29 14:27     ` Minchan Kim
@ 2010-07-01  9:53       ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-01  9:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jun 29, 2010 at 11:27:05PM +0900, Minchan Kim wrote:
> On Tue, Jun 29, 2010 at 8:34 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > From: Nick Piggin <npiggin@suse.de>
> >
> > Use lock_page_nosync in handle_write_error as after writepage we have no
> > reference to the mapping when taking the page lock.
> >
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> Trivial.
> Please modify description of the function if you have a next turn.
> "run sleeping lock_page()" -> "run sleeping lock_page_nosync"
> 

Fixed, thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 01/14] vmscan: Fix mapping use after free
@ 2010-07-01  9:53       ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-01  9:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Tue, Jun 29, 2010 at 11:27:05PM +0900, Minchan Kim wrote:
> On Tue, Jun 29, 2010 at 8:34 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > From: Nick Piggin <npiggin@suse.de>
> >
> > Use lock_page_nosync in handle_write_error as after writepage we have no
> > reference to the mapping when taking the page lock.
> >
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> Trivial.
> Please modify description of the function if you have a next turn.
> "run sleeping lock_page()" -> "run sleeping lock_page_nosync"
> 

Fixed, thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 13/14] fs,btrfs: Allow kswapd to writeback pages
  2010-06-30 13:05     ` Chris Mason
  (?)
@ 2010-07-01  9:55       ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-01  9:55 UTC (permalink / raw)
  To: Chris Mason, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, Jun 30, 2010 at 09:05:04AM -0400, Chris Mason wrote:
> On Tue, Jun 29, 2010 at 12:34:47PM +0100, Mel Gorman wrote:
> > As only kswapd and memcg are writing back pages, there should be no
> > danger of overflowing the stack. Allow the writing back of dirty pages
> > in btrfs from the VM.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> Signed-off-by: Chris Mason <chris.mason@oracle.com>
> 
> But, this is only the metadata writepage.  fs/btrfs/inode.c has another
> one for data pages.  (just look for PF_MEMALLOC).
> 

My bad, fixed now. Thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 13/14] fs,btrfs: Allow kswapd to writeback pages
@ 2010-07-01  9:55       ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-01  9:55 UTC (permalink / raw)
  To: Chris Mason, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Nick Piggin

On Wed, Jun 30, 2010 at 09:05:04AM -0400, Chris Mason wrote:
> On Tue, Jun 29, 2010 at 12:34:47PM +0100, Mel Gorman wrote:
> > As only kswapd and memcg are writing back pages, there should be no
> > danger of overflowing the stack. Allow the writing back of dirty pages
> > in btrfs from the VM.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> Signed-off-by: Chris Mason <chris.mason@oracle.com>
> 
> But, this is only the metadata writepage.  fs/btrfs/inode.c has another
> one for data pages.  (just look for PF_MEMALLOC).
> 

My bad, fixed now. Thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 13/14] fs,btrfs: Allow kswapd to writeback pages
@ 2010-07-01  9:55       ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-01  9:55 UTC (permalink / raw)
  To: Chris Mason, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, Jun 30, 2010 at 09:05:04AM -0400, Chris Mason wrote:
> On Tue, Jun 29, 2010 at 12:34:47PM +0100, Mel Gorman wrote:
> > As only kswapd and memcg are writing back pages, there should be no
> > danger of overflowing the stack. Allow the writing back of dirty pages
> > in btrfs from the VM.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> Signed-off-by: Chris Mason <chris.mason@oracle.com>
> 
> But, this is only the metadata writepage.  fs/btrfs/inode.c has another
> one for data pages.  (just look for PF_MEMALLOC).
> 

My bad, fixed now. Thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 14/14] fs,xfs: Allow kswapd to writeback pages
  2010-06-30  0:14         ` KAMEZAWA Hiroyuki
@ 2010-07-01 10:30           ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-01 10:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, Jun 30, 2010 at 09:14:11AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 29 Jun 2010 13:51:43 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On Tue, Jun 29, 2010 at 08:37:22AM -0400, Christoph Hellwig wrote:
> > > I don't see a patch in this set which refuses writeback from the memcg
> > > context, which we identified as having large stack footprint in hte
> > > discussion of the last patch set.
> > > 
> > 
> > It wasn't clear to me what the right approach was there and should
> > have noted that in the intro. The last note I have on it is this message
> > http://kerneltrap.org/mailarchive/linux-kernel/2010/6/17/4584087 which might
> > avoid the deep stack usage but I wasn't 100% sure. As kswapd doesn't clean
> > pages for memcg, I left memcg being able to direct writeback to see what
> > the memcg people preferred.
> > 
> 
> Hmm. If some filesystems don't support direct ->writeback,

This is not strictly true. All of them support ->writeback, but some of
them are ignoring it from reclaim context - right now, xfs, btrfs and extN
so it's a sizable cross-section of filesystems we care about.

> memcg shouldn't
> depends on it. If so, memcg should depends on some writeback-thread (as kswapd).
> ok.
> 
> Then, my concern here is that which kswapd we should wake up and how it can stop.

And also what the consequences are of kswapd being occupied with containers
instead of the global lists for a time.

> IOW, how kswapd can know a memcg has some remaining writeback and struck on it.
> 

Another possibility for memcg would be to visit Andrea's suggestion on
switching stack in more detail. I still haven't gotten around to this as
phd stuff is sucking up piles of my time.

> One idea is here. (this patch will not work...not tested at all.)
> If we can have "victim page list" and kswapd can depend on it to know
> "which pages should be written", kswapd can know when it should work.
> 
> cpu usage by memcg will be a new problem...but...
> 
> ==
> Add a new LRU "CLEANING" and make kswapd launder it.
> This patch also changes PG_reclaim behavior. New PG_reclaim works
> as
>  - If PG_reclaim is set, a page is on CLEAINING LIST.
> 
> And when kswapd launder a page
>  - issue an writeback. (I'm thinking whehter I should put this
>    cleaned page back to CLEANING lru and free it later.) 
>  - if it can free directly, free it.
> This just use current shrink_list().
> 
> Maybe this patch itself inlcludes many bad point...
> 
> ---
>  fs/proc/meminfo.c         |    2 
>  include/linux/mm_inline.h |    9 ++
>  include/linux/mmzone.h    |    7 ++
>  mm/filemap.c              |    3 
>  mm/internal.h             |    1 
>  mm/page-writeback.c       |    1 
>  mm/page_io.c              |    1 
>  mm/swap.c                 |   31 ++-------
>  mm/vmscan.c               |  153 +++++++++++++++++++++++++++++++++++++++++++++-
>  9 files changed, 176 insertions(+), 32 deletions(-)
> 
> Index: mmotm-0611/include/linux/mmzone.h
> ===================================================================
> --- mmotm-0611.orig/include/linux/mmzone.h
> +++ mmotm-0611/include/linux/mmzone.h
> @@ -85,6 +85,7 @@ enum zone_stat_item {
>  	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
>  	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
>  	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
> +	NR_CLEANING,		/*  "     "     "   "       "         */
>  	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
>  	NR_ANON_PAGES,	/* Mapped anonymous pages */
>  	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
> @@ -133,6 +134,7 @@ enum lru_list {
>  	LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
>  	LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
>  	LRU_UNEVICTABLE,
> +	LRU_CLEANING,
>  	NR_LRU_LISTS
>  };
>  
> @@ -155,6 +157,11 @@ static inline int is_unevictable_lru(enu
>  	return (l == LRU_UNEVICTABLE);
>  }
>  
> +static inline int is_cleaning_lru(enum lru_list l)
> +{
> +	return (l == LRU_CLEANING);
> +}
> +

Nit - LRU_CLEAN_PENDING might be clearer as CLEANING implies it is currently
being cleaned (implying it's the same as NR_WRITEBACK) or is definely dirty
implying it's the same as NR_DIRTY.

>  enum zone_watermarks {
>  	WMARK_MIN,
>  	WMARK_LOW,
> Index: mmotm-0611/include/linux/mm_inline.h
> ===================================================================
> --- mmotm-0611.orig/include/linux/mm_inline.h
> +++ mmotm-0611/include/linux/mm_inline.h
> @@ -56,7 +56,10 @@ del_page_from_lru(struct zone *zone, str
>  	enum lru_list l;
>  
>  	list_del(&page->lru);
> -	if (PageUnevictable(page)) {
> +	if (PageReclaim(page)) {
> +		ClearPageReclaim(page);
> +		l = LRU_CLEANING;
> +	} else if (PageUnevictable(page)) {
>  		__ClearPageUnevictable(page);
>  		l = LRU_UNEVICTABLE;
>  	} else {

One point of note is that having a LRU cleaning list will alter the aging
of pages quite a bit.

A slightly greater concern is that clean pages can be temporarily "lost"
on the cleaning list. If a direct reclaimer moves pages to the LRU_CLEANING
list, it's no longer considering those pages even if a flusher thread
happened to clean those pages before kswapd had a chance. Lets say under
heavy memory pressure a lot of pages are being dirties and encountered on
the LRU list. They move to LRU_CLEANING where dirty balancing starts making
sure they get cleaned but are no longer being reclaimed.

Of course, I might be wrong but it's not a trivial direction to take.

> @@ -81,7 +84,9 @@ static inline enum lru_list page_lru(str
>  {
>  	enum lru_list lru;
>  
> -	if (PageUnevictable(page))
> +	if (PageReclaim(page)) {
> +		lru = LRU_CLEANING;
> +	} else if (PageUnevictable(page))
>  		lru = LRU_UNEVICTABLE;
>  	else {
>  		lru = page_lru_base_type(page);
> Index: mmotm-0611/fs/proc/meminfo.c
> ===================================================================
> --- mmotm-0611.orig/fs/proc/meminfo.c
> +++ mmotm-0611/fs/proc/meminfo.c
> @@ -65,6 +65,7 @@ static int meminfo_proc_show(struct seq_
>  		"Active(file):   %8lu kB\n"
>  		"Inactive(file): %8lu kB\n"
>  		"Unevictable:    %8lu kB\n"
> +		"Cleaning:       %8lu kB\n"
>  		"Mlocked:        %8lu kB\n"
>  #ifdef CONFIG_HIGHMEM
>  		"HighTotal:      %8lu kB\n"
> @@ -114,6 +115,7 @@ static int meminfo_proc_show(struct seq_
>  		K(pages[LRU_ACTIVE_FILE]),
>  		K(pages[LRU_INACTIVE_FILE]),
>  		K(pages[LRU_UNEVICTABLE]),
> +		K(pages[LRU_CLEANING]),
>  		K(global_page_state(NR_MLOCK)),
>  #ifdef CONFIG_HIGHMEM
>  		K(i.totalhigh),
> Index: mmotm-0611/mm/swap.c
> ===================================================================
> --- mmotm-0611.orig/mm/swap.c
> +++ mmotm-0611/mm/swap.c
> @@ -118,8 +118,8 @@ static void pagevec_move_tail(struct pag
>  			zone = pagezone;
>  			spin_lock(&zone->lru_lock);
>  		}
> -		if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
> -			int lru = page_lru_base_type(page);
> +		if (PageLRU(page)) {
> +			int lru = page_lru(page);
>  			list_move_tail(&page->lru, &zone->lru[lru].list);
>  			pgmoved++;
>  		}
> @@ -131,27 +131,6 @@ static void pagevec_move_tail(struct pag
>  	pagevec_reinit(pvec);
>  }
>  
> -/*
> - * Writeback is about to end against a page which has been marked for immediate
> - * reclaim.  If it still appears to be reclaimable, move it to the tail of the
> - * inactive list.
> - */
> -void  rotate_reclaimable_page(struct page *page)
> -{
> -	if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
> -	    !PageUnevictable(page) && PageLRU(page)) {
> -		struct pagevec *pvec;
> -		unsigned long flags;
> -
> -		page_cache_get(page);
> -		local_irq_save(flags);
> -		pvec = &__get_cpu_var(lru_rotate_pvecs);
> -		if (!pagevec_add(pvec, page))
> -			pagevec_move_tail(pvec);
> -		local_irq_restore(flags);
> -	}
> -}
> -
>  static void update_page_reclaim_stat(struct zone *zone, struct page *page,
>  				     int file, int rotated)
>  {
> @@ -235,10 +214,16 @@ void lru_cache_add_lru(struct page *page
>  {
>  	if (PageActive(page)) {
>  		VM_BUG_ON(PageUnevictable(page));
> +		VM_BUG_ON(PageReclaim(page));
>  		ClearPageActive(page);
>  	} else if (PageUnevictable(page)) {
>  		VM_BUG_ON(PageActive(page));
> +		VM_BUG_ON(PageReclaim(page));
>  		ClearPageUnevictable(page);
> +	} else if (PageReclaim(page)) {
> +		VM_BUG_ON(PageReclaim(page));
> +		VM_BUG_ON(PageUnevictable(page));
> +		ClearPageReclaim(page);
>  	}
>  
>  	VM_BUG_ON(PageLRU(page) || PageActive(page) || PageUnevictable(page));
> Index: mmotm-0611/mm/filemap.c
> ===================================================================
> --- mmotm-0611.orig/mm/filemap.c
> +++ mmotm-0611/mm/filemap.c
> @@ -560,9 +560,6 @@ EXPORT_SYMBOL(unlock_page);
>   */
>  void end_page_writeback(struct page *page)
>  {
> -	if (TestClearPageReclaim(page))
> -		rotate_reclaimable_page(page);
> -
>  	if (!test_clear_page_writeback(page))
>  		BUG();
>  
> Index: mmotm-0611/mm/internal.h
> ===================================================================
> --- mmotm-0611.orig/mm/internal.h
> +++ mmotm-0611/mm/internal.h
> @@ -259,3 +259,4 @@ extern u64 hwpoison_filter_flags_mask;
>  extern u64 hwpoison_filter_flags_value;
>  extern u64 hwpoison_filter_memcg;
>  extern u32 hwpoison_filter_enable;
> +
> Index: mmotm-0611/mm/page-writeback.c
> ===================================================================
> --- mmotm-0611.orig/mm/page-writeback.c
> +++ mmotm-0611/mm/page-writeback.c
> @@ -1252,7 +1252,6 @@ int clear_page_dirty_for_io(struct page 
>  
>  	BUG_ON(!PageLocked(page));
>  
> -	ClearPageReclaim(page);
>  	if (mapping && mapping_cap_account_dirty(mapping)) {
>  		/*
>  		 * Yes, Virginia, this is indeed insane.
> Index: mmotm-0611/mm/page_io.c
> ===================================================================
> --- mmotm-0611.orig/mm/page_io.c
> +++ mmotm-0611/mm/page_io.c
> @@ -60,7 +60,6 @@ static void end_swap_bio_write(struct bi
>  				imajor(bio->bi_bdev->bd_inode),
>  				iminor(bio->bi_bdev->bd_inode),
>  				(unsigned long long)bio->bi_sector);
> -		ClearPageReclaim(page);
>  	}
>  	end_page_writeback(page);
>  	bio_put(bio);
> Index: mmotm-0611/mm/vmscan.c
> ===================================================================
> --- mmotm-0611.orig/mm/vmscan.c
> +++ mmotm-0611/mm/vmscan.c
> @@ -364,6 +364,12 @@ static pageout_t pageout(struct page *pa
>  	if (!may_write_to_queue(mapping->backing_dev_info))
>  		return PAGE_KEEP;
>  
> +	if (!current_is_kswapd()) {
> +		/* pass this page to kswapd. */
> +		SetPageReclaim(page);
> +		return PAGE_KEEP;
> +	}
> +
>  	if (clear_page_dirty_for_io(page)) {
>  		int res;
>  		struct writeback_control wbc = {
> @@ -503,6 +509,8 @@ void putback_lru_page(struct page *page)
>  
>  redo:
>  	ClearPageUnevictable(page);
> +	/* This function never puts pages to CLEANING queue */
> +	ClearPageReclaim(page);
>  
>  	if (page_evictable(page, NULL)) {
>  		/*
> @@ -883,6 +891,8 @@ int __isolate_lru_page(struct page *page
>  		 * page release code relies on it.
>  		 */
>  		ClearPageLRU(page);
> +		/* when someone isolate this page, clear reclaim status */
> +		ClearPageReclaim(page);
>  		ret = 0;
>  	}
>  
> @@ -1020,7 +1030,7 @@ static unsigned long isolate_pages_globa
>   * # of pages of each types and clearing any active bits.
>   */
>  static unsigned long count_page_types(struct list_head *page_list,
> -				unsigned int *count, int clear_active)
> +				unsigned int *count, int clear_actives)
>  {
>  	int nr_active = 0;
>  	int lru;
> @@ -1076,6 +1086,7 @@ int isolate_lru_page(struct page *page)
>  			int lru = page_lru(page);
>  			ret = 0;
>  			ClearPageLRU(page);
> +			ClearPageReclaim(page);
>  
>  			del_page_from_lru_list(zone, page, lru);
>  		}
> @@ -1109,6 +1120,103 @@ static int too_many_isolated(struct zone
>  	return isolated > inactive;
>  }
>  
> +/* only called by kswapd to do I/O and put back clean paes to its LRU */
> +static void shrink_cleaning_list(struct zone *zone)
> +{
> +	LIST_HEAD(page_list);
> +	struct list_head *src;
> +	struct pagevec pvec;
> +	unsigned long nr_pageout;
> +	unsigned long nr_cleaned;
> +	struct scan_control sc = {
> +		.gfp_mask = GFP_KERNEL,
> +		.may_unmap = 1,
> +		.may_swap = 1,
> +		.nr_to_reclaim = ULONG_MAX,
> +		.swappiness = vm_swappiness,
> +		.order = 1,
> +		.mem_cgroup = NULL,
> +	};
> +
> +	pagevec_init(&pvec, 1);
> +	lru_add_drain();
> +
> +	src = &zone->lru[LRU_CLEANING].list;
> +	nr_pageout = 0;
> +	nr_cleaned = 0;
> +	spin_lock_irq(&zone->lru_lock);
> +	do {
> +		unsigned int count[NR_LRU_LISTS] = {0,};
> +		unsigned int nr_anon, nr_file, nr_taken, check_clean, nr_freed;
> +		unsigned long nr_scan;
> +
> +		if (list_empty(src))
> +			goto done;
> +
> +		check_clean = max((unsigned long)SWAP_CLUSTER_MAX,
> +				zone_page_state(zone, NR_CLEANING)/8);
> +		/* we do global-only */
> +		nr_taken = isolate_lru_pages(check_clean,
> +					src, &page_list, &nr_scan,
> +					0, ISOLATE_BOTH, 0);
> +		zone->pages_scanned += nr_scan;
> +		__count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scan);
> +		if (nr_taken == 0)
> +			goto done;
> +		__mod_zone_page_state(zone, NR_CLEANING, -nr_taken);
> +		spin_unlock_irq(&zone->lru_lock);
> +		/*
> +		 * Because PG_reclaim flag is deleted by isolate_lru_page(),
> +		 * we can count correct value
> +		 */
> +		count_page_types(&page_list, count, 0);
> +		nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
> +		nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
> +		__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
> +		__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
> +
> +		nr_freed = shrink_page_list(&page_list, &sc, PAGEOUT_IO_ASYNC);

So, at this point the isolated pages are cleaned and put back which is
fine. If they were already clean, they get freed which is also fine. But
direct reclaimers do not call this function so they could be missing
clean and freeable pages which worries me.

> +		/*
> +		 * Put back any unfreeable pages.
> +		 */
> +		while (!list_empty(&page_list)) {
> +			int lru;
> +			struct page *page;
> +
> +			page = lru_to_page(&page_list);
> +			VM_BUG_ON(PageLRU(page));
> +			list_del(&page->lru);
> +			if (!unlikely(!page_evictable(page, NULL))) {
> +				spin_unlock_irq(&zone->lru_lock);
> +				putback_lru_page(page);
> +				spin_lock_irq(&zone->lru_lock);
> +				continue;
> +			}
> +			SetPageLRU(page);
> +			lru = page_lru(page);
> +			add_page_to_lru_list(zone, page, lru);
> +			if (!pagevec_add(&pvec, page)) {
> +				spin_unlock_irq(&zone->lru_lock);
> +				__pagevec_release(&pvec);
> +				spin_lock_irq(&zone->lru_lock);
> +			}
> +		}
> +		__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
> +		__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
> +		nr_pageout += nr_taken - nr_freed;
> +		nr_cleaned += nr_freed;
> +		if (nr_pageout > SWAP_CLUSTER_MAX) {
> +			/* there are remaining I/Os */
> +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> +			nr_pageout /= 2;
> +		}
> +	} while(nr_cleaned < SWAP_CLUSTER_MAX);
> +done:
> +	spin_unlock_irq(&zone->lru_lock);
> +	pagevec_release(&pvec);
> +	return;
> +}
> +
>  /*
>   * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
>   * of reclaimed pages
> @@ -1736,6 +1844,9 @@ static bool shrink_zones(int priority, s
>  					sc->nodemask) {
>  		if (!populated_zone(zone))
>  			continue;
> +
> +		if (current_is_kswapd())
> +			shrink_cleaning_list(zone);
>  		/*
>  		 * Take care memory controller reclaiming has small influence
>  		 * to global LRU.
> @@ -2222,6 +2333,42 @@ out:
>  	return sc.nr_reclaimed;
>  }
>  
> +static void launder_pgdat(pg_data_t *pgdat)
> +{
> +	struct zone *zone;
> +	int i;
> +
> +	for (i = 0; i < MAX_NR_ZONES; i++) {
> +
> +		zone = &pgdat->node_zones[i];
> +		if (!populated_zone(zone))
> +			continue;
> +		if (zone_page_state(zone, NR_CLEANING))
> +			break;
> +		shrink_cleaning_list(zone);
> +	}
> +}
> +
> +/*
> + * Find a zone which has cleaning list.
> + */
> +static int need_to_cleaning_node(pg_data_t *pgdat)
> +{
> +	int i;
> +	struct zone *zone;
> +
> +	for (i = 0; i < MAX_NR_ZONES; i++) {
> +
> +		zone = &pgdat->node_zones[i];
> +		if (!populated_zone(zone))
> +			continue;
> +		if (zone_page_state(zone, NR_CLEANING))
> +			break;
> +	}
> +	return (i != MAX_NR_ZONES);
> +}
> +
> +
>  /*
>   * The background pageout daemon, started as a kernel thread
>   * from the init process.
> @@ -2275,7 +2422,9 @@ static int kswapd(void *p)
>  		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
>  		new_order = pgdat->kswapd_max_order;
>  		pgdat->kswapd_max_order = 0;
> -		if (order < new_order) {
> +		if (need_to_cleaning_node(pgdat)) {
> +			launder_pgdat(pgdat);
> +		} else if (order < new_order) {
>  			/*
>  			 * Don't sleep if someone wants a larger 'order'
>  			 * allocation

I see the direction you are thinking of but I have big concerns about clean
pages getting delayed for too long on the LRU_CLEANING pages before kswapd
puts them back in the right place. I think a safer direction would be for
memcg people to investigate Andrea's "switch stack" suggestion.

In the meantime for my own series, memcg now treats dirty pages similar to
lumpy reclaim. It asks flusher threads to clean pages but stalls waiting
for those pages to be cleaned for a time. This is an untested patch on top
of the current series.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5c4f08b..81c6fbe 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -367,10 +367,10 @@ int write_reclaim_page(struct page *page, struct address_space *mapping,
 	return PAGE_SUCCESS;
 }
 
-/* kswapd and memcg can writeback as they are unlikely to overflow stack */
+/* For now, only kswapd can writeback as it will not overflow stack */
 static inline bool reclaim_can_writeback(struct scan_control *sc)
 {
-	return current_is_kswapd() || sc->mem_cgroup != NULL;
+	return current_is_kswapd();
 }
 
 /*
@@ -900,10 +900,11 @@ keep_dirty:
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
-		 * As lumpy reclaim targets specific pages, wait on them
-		 * to be cleaned and try reclaim again for a time.
+		 * As lumpy reclaim and memcg targets specific pages, wait on
+		 * them to be cleaned and try reclaim again.
 		 */
-		if (sync_writeback == PAGEOUT_IO_SYNC) {
+		if (sync_writeback == PAGEOUT_IO_SYNC ||
+							sc->mem_cgroup != NULL) {
 			dirty_isolated++;
 			list_splice(&dirty_pages, page_list);
 			INIT_LIST_HEAD(&dirty_pages);

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 14/14] fs,xfs: Allow kswapd to writeback pages
@ 2010-07-01 10:30           ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-01 10:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Wed, Jun 30, 2010 at 09:14:11AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 29 Jun 2010 13:51:43 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On Tue, Jun 29, 2010 at 08:37:22AM -0400, Christoph Hellwig wrote:
> > > I don't see a patch in this set which refuses writeback from the memcg
> > > context, which we identified as having large stack footprint in hte
> > > discussion of the last patch set.
> > > 
> > 
> > It wasn't clear to me what the right approach was there and should
> > have noted that in the intro. The last note I have on it is this message
> > http://kerneltrap.org/mailarchive/linux-kernel/2010/6/17/4584087 which might
> > avoid the deep stack usage but I wasn't 100% sure. As kswapd doesn't clean
> > pages for memcg, I left memcg being able to direct writeback to see what
> > the memcg people preferred.
> > 
> 
> Hmm. If some filesystems don't support direct ->writeback,

This is not strictly true. All of them support ->writeback, but some of
them are ignoring it from reclaim context - right now, xfs, btrfs and extN
so it's a sizable cross-section of filesystems we care about.

> memcg shouldn't
> depends on it. If so, memcg should depends on some writeback-thread (as kswapd).
> ok.
> 
> Then, my concern here is that which kswapd we should wake up and how it can stop.

And also what the consequences are of kswapd being occupied with containers
instead of the global lists for a time.

> IOW, how kswapd can know a memcg has some remaining writeback and struck on it.
> 

Another possibility for memcg would be to visit Andrea's suggestion on
switching stack in more detail. I still haven't gotten around to this as
phd stuff is sucking up piles of my time.

> One idea is here. (this patch will not work...not tested at all.)
> If we can have "victim page list" and kswapd can depend on it to know
> "which pages should be written", kswapd can know when it should work.
> 
> cpu usage by memcg will be a new problem...but...
> 
> ==
> Add a new LRU "CLEANING" and make kswapd launder it.
> This patch also changes PG_reclaim behavior. New PG_reclaim works
> as
>  - If PG_reclaim is set, a page is on CLEAINING LIST.
> 
> And when kswapd launder a page
>  - issue an writeback. (I'm thinking whehter I should put this
>    cleaned page back to CLEANING lru and free it later.) 
>  - if it can free directly, free it.
> This just use current shrink_list().
> 
> Maybe this patch itself inlcludes many bad point...
> 
> ---
>  fs/proc/meminfo.c         |    2 
>  include/linux/mm_inline.h |    9 ++
>  include/linux/mmzone.h    |    7 ++
>  mm/filemap.c              |    3 
>  mm/internal.h             |    1 
>  mm/page-writeback.c       |    1 
>  mm/page_io.c              |    1 
>  mm/swap.c                 |   31 ++-------
>  mm/vmscan.c               |  153 +++++++++++++++++++++++++++++++++++++++++++++-
>  9 files changed, 176 insertions(+), 32 deletions(-)
> 
> Index: mmotm-0611/include/linux/mmzone.h
> ===================================================================
> --- mmotm-0611.orig/include/linux/mmzone.h
> +++ mmotm-0611/include/linux/mmzone.h
> @@ -85,6 +85,7 @@ enum zone_stat_item {
>  	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
>  	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
>  	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
> +	NR_CLEANING,		/*  "     "     "   "       "         */
>  	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
>  	NR_ANON_PAGES,	/* Mapped anonymous pages */
>  	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
> @@ -133,6 +134,7 @@ enum lru_list {
>  	LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
>  	LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
>  	LRU_UNEVICTABLE,
> +	LRU_CLEANING,
>  	NR_LRU_LISTS
>  };
>  
> @@ -155,6 +157,11 @@ static inline int is_unevictable_lru(enu
>  	return (l == LRU_UNEVICTABLE);
>  }
>  
> +static inline int is_cleaning_lru(enum lru_list l)
> +{
> +	return (l == LRU_CLEANING);
> +}
> +

Nit - LRU_CLEAN_PENDING might be clearer as CLEANING implies it is currently
being cleaned (implying it's the same as NR_WRITEBACK) or is definely dirty
implying it's the same as NR_DIRTY.

>  enum zone_watermarks {
>  	WMARK_MIN,
>  	WMARK_LOW,
> Index: mmotm-0611/include/linux/mm_inline.h
> ===================================================================
> --- mmotm-0611.orig/include/linux/mm_inline.h
> +++ mmotm-0611/include/linux/mm_inline.h
> @@ -56,7 +56,10 @@ del_page_from_lru(struct zone *zone, str
>  	enum lru_list l;
>  
>  	list_del(&page->lru);
> -	if (PageUnevictable(page)) {
> +	if (PageReclaim(page)) {
> +		ClearPageReclaim(page);
> +		l = LRU_CLEANING;
> +	} else if (PageUnevictable(page)) {
>  		__ClearPageUnevictable(page);
>  		l = LRU_UNEVICTABLE;
>  	} else {

One point of note is that having a LRU cleaning list will alter the aging
of pages quite a bit.

A slightly greater concern is that clean pages can be temporarily "lost"
on the cleaning list. If a direct reclaimer moves pages to the LRU_CLEANING
list, it's no longer considering those pages even if a flusher thread
happened to clean those pages before kswapd had a chance. Lets say under
heavy memory pressure a lot of pages are being dirties and encountered on
the LRU list. They move to LRU_CLEANING where dirty balancing starts making
sure they get cleaned but are no longer being reclaimed.

Of course, I might be wrong but it's not a trivial direction to take.

> @@ -81,7 +84,9 @@ static inline enum lru_list page_lru(str
>  {
>  	enum lru_list lru;
>  
> -	if (PageUnevictable(page))
> +	if (PageReclaim(page)) {
> +		lru = LRU_CLEANING;
> +	} else if (PageUnevictable(page))
>  		lru = LRU_UNEVICTABLE;
>  	else {
>  		lru = page_lru_base_type(page);
> Index: mmotm-0611/fs/proc/meminfo.c
> ===================================================================
> --- mmotm-0611.orig/fs/proc/meminfo.c
> +++ mmotm-0611/fs/proc/meminfo.c
> @@ -65,6 +65,7 @@ static int meminfo_proc_show(struct seq_
>  		"Active(file):   %8lu kB\n"
>  		"Inactive(file): %8lu kB\n"
>  		"Unevictable:    %8lu kB\n"
> +		"Cleaning:       %8lu kB\n"
>  		"Mlocked:        %8lu kB\n"
>  #ifdef CONFIG_HIGHMEM
>  		"HighTotal:      %8lu kB\n"
> @@ -114,6 +115,7 @@ static int meminfo_proc_show(struct seq_
>  		K(pages[LRU_ACTIVE_FILE]),
>  		K(pages[LRU_INACTIVE_FILE]),
>  		K(pages[LRU_UNEVICTABLE]),
> +		K(pages[LRU_CLEANING]),
>  		K(global_page_state(NR_MLOCK)),
>  #ifdef CONFIG_HIGHMEM
>  		K(i.totalhigh),
> Index: mmotm-0611/mm/swap.c
> ===================================================================
> --- mmotm-0611.orig/mm/swap.c
> +++ mmotm-0611/mm/swap.c
> @@ -118,8 +118,8 @@ static void pagevec_move_tail(struct pag
>  			zone = pagezone;
>  			spin_lock(&zone->lru_lock);
>  		}
> -		if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
> -			int lru = page_lru_base_type(page);
> +		if (PageLRU(page)) {
> +			int lru = page_lru(page);
>  			list_move_tail(&page->lru, &zone->lru[lru].list);
>  			pgmoved++;
>  		}
> @@ -131,27 +131,6 @@ static void pagevec_move_tail(struct pag
>  	pagevec_reinit(pvec);
>  }
>  
> -/*
> - * Writeback is about to end against a page which has been marked for immediate
> - * reclaim.  If it still appears to be reclaimable, move it to the tail of the
> - * inactive list.
> - */
> -void  rotate_reclaimable_page(struct page *page)
> -{
> -	if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
> -	    !PageUnevictable(page) && PageLRU(page)) {
> -		struct pagevec *pvec;
> -		unsigned long flags;
> -
> -		page_cache_get(page);
> -		local_irq_save(flags);
> -		pvec = &__get_cpu_var(lru_rotate_pvecs);
> -		if (!pagevec_add(pvec, page))
> -			pagevec_move_tail(pvec);
> -		local_irq_restore(flags);
> -	}
> -}
> -
>  static void update_page_reclaim_stat(struct zone *zone, struct page *page,
>  				     int file, int rotated)
>  {
> @@ -235,10 +214,16 @@ void lru_cache_add_lru(struct page *page
>  {
>  	if (PageActive(page)) {
>  		VM_BUG_ON(PageUnevictable(page));
> +		VM_BUG_ON(PageReclaim(page));
>  		ClearPageActive(page);
>  	} else if (PageUnevictable(page)) {
>  		VM_BUG_ON(PageActive(page));
> +		VM_BUG_ON(PageReclaim(page));
>  		ClearPageUnevictable(page);
> +	} else if (PageReclaim(page)) {
> +		VM_BUG_ON(PageReclaim(page));
> +		VM_BUG_ON(PageUnevictable(page));
> +		ClearPageReclaim(page);
>  	}
>  
>  	VM_BUG_ON(PageLRU(page) || PageActive(page) || PageUnevictable(page));
> Index: mmotm-0611/mm/filemap.c
> ===================================================================
> --- mmotm-0611.orig/mm/filemap.c
> +++ mmotm-0611/mm/filemap.c
> @@ -560,9 +560,6 @@ EXPORT_SYMBOL(unlock_page);
>   */
>  void end_page_writeback(struct page *page)
>  {
> -	if (TestClearPageReclaim(page))
> -		rotate_reclaimable_page(page);
> -
>  	if (!test_clear_page_writeback(page))
>  		BUG();
>  
> Index: mmotm-0611/mm/internal.h
> ===================================================================
> --- mmotm-0611.orig/mm/internal.h
> +++ mmotm-0611/mm/internal.h
> @@ -259,3 +259,4 @@ extern u64 hwpoison_filter_flags_mask;
>  extern u64 hwpoison_filter_flags_value;
>  extern u64 hwpoison_filter_memcg;
>  extern u32 hwpoison_filter_enable;
> +
> Index: mmotm-0611/mm/page-writeback.c
> ===================================================================
> --- mmotm-0611.orig/mm/page-writeback.c
> +++ mmotm-0611/mm/page-writeback.c
> @@ -1252,7 +1252,6 @@ int clear_page_dirty_for_io(struct page 
>  
>  	BUG_ON(!PageLocked(page));
>  
> -	ClearPageReclaim(page);
>  	if (mapping && mapping_cap_account_dirty(mapping)) {
>  		/*
>  		 * Yes, Virginia, this is indeed insane.
> Index: mmotm-0611/mm/page_io.c
> ===================================================================
> --- mmotm-0611.orig/mm/page_io.c
> +++ mmotm-0611/mm/page_io.c
> @@ -60,7 +60,6 @@ static void end_swap_bio_write(struct bi
>  				imajor(bio->bi_bdev->bd_inode),
>  				iminor(bio->bi_bdev->bd_inode),
>  				(unsigned long long)bio->bi_sector);
> -		ClearPageReclaim(page);
>  	}
>  	end_page_writeback(page);
>  	bio_put(bio);
> Index: mmotm-0611/mm/vmscan.c
> ===================================================================
> --- mmotm-0611.orig/mm/vmscan.c
> +++ mmotm-0611/mm/vmscan.c
> @@ -364,6 +364,12 @@ static pageout_t pageout(struct page *pa
>  	if (!may_write_to_queue(mapping->backing_dev_info))
>  		return PAGE_KEEP;
>  
> +	if (!current_is_kswapd()) {
> +		/* pass this page to kswapd. */
> +		SetPageReclaim(page);
> +		return PAGE_KEEP;
> +	}
> +
>  	if (clear_page_dirty_for_io(page)) {
>  		int res;
>  		struct writeback_control wbc = {
> @@ -503,6 +509,8 @@ void putback_lru_page(struct page *page)
>  
>  redo:
>  	ClearPageUnevictable(page);
> +	/* This function never puts pages to CLEANING queue */
> +	ClearPageReclaim(page);
>  
>  	if (page_evictable(page, NULL)) {
>  		/*
> @@ -883,6 +891,8 @@ int __isolate_lru_page(struct page *page
>  		 * page release code relies on it.
>  		 */
>  		ClearPageLRU(page);
> +		/* when someone isolate this page, clear reclaim status */
> +		ClearPageReclaim(page);
>  		ret = 0;
>  	}
>  
> @@ -1020,7 +1030,7 @@ static unsigned long isolate_pages_globa
>   * # of pages of each types and clearing any active bits.
>   */
>  static unsigned long count_page_types(struct list_head *page_list,
> -				unsigned int *count, int clear_active)
> +				unsigned int *count, int clear_actives)
>  {
>  	int nr_active = 0;
>  	int lru;
> @@ -1076,6 +1086,7 @@ int isolate_lru_page(struct page *page)
>  			int lru = page_lru(page);
>  			ret = 0;
>  			ClearPageLRU(page);
> +			ClearPageReclaim(page);
>  
>  			del_page_from_lru_list(zone, page, lru);
>  		}
> @@ -1109,6 +1120,103 @@ static int too_many_isolated(struct zone
>  	return isolated > inactive;
>  }
>  
> +/* only called by kswapd to do I/O and put back clean paes to its LRU */
> +static void shrink_cleaning_list(struct zone *zone)
> +{
> +	LIST_HEAD(page_list);
> +	struct list_head *src;
> +	struct pagevec pvec;
> +	unsigned long nr_pageout;
> +	unsigned long nr_cleaned;
> +	struct scan_control sc = {
> +		.gfp_mask = GFP_KERNEL,
> +		.may_unmap = 1,
> +		.may_swap = 1,
> +		.nr_to_reclaim = ULONG_MAX,
> +		.swappiness = vm_swappiness,
> +		.order = 1,
> +		.mem_cgroup = NULL,
> +	};
> +
> +	pagevec_init(&pvec, 1);
> +	lru_add_drain();
> +
> +	src = &zone->lru[LRU_CLEANING].list;
> +	nr_pageout = 0;
> +	nr_cleaned = 0;
> +	spin_lock_irq(&zone->lru_lock);
> +	do {
> +		unsigned int count[NR_LRU_LISTS] = {0,};
> +		unsigned int nr_anon, nr_file, nr_taken, check_clean, nr_freed;
> +		unsigned long nr_scan;
> +
> +		if (list_empty(src))
> +			goto done;
> +
> +		check_clean = max((unsigned long)SWAP_CLUSTER_MAX,
> +				zone_page_state(zone, NR_CLEANING)/8);
> +		/* we do global-only */
> +		nr_taken = isolate_lru_pages(check_clean,
> +					src, &page_list, &nr_scan,
> +					0, ISOLATE_BOTH, 0);
> +		zone->pages_scanned += nr_scan;
> +		__count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scan);
> +		if (nr_taken == 0)
> +			goto done;
> +		__mod_zone_page_state(zone, NR_CLEANING, -nr_taken);
> +		spin_unlock_irq(&zone->lru_lock);
> +		/*
> +		 * Because PG_reclaim flag is deleted by isolate_lru_page(),
> +		 * we can count correct value
> +		 */
> +		count_page_types(&page_list, count, 0);
> +		nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
> +		nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
> +		__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
> +		__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
> +
> +		nr_freed = shrink_page_list(&page_list, &sc, PAGEOUT_IO_ASYNC);

So, at this point the isolated pages are cleaned and put back which is
fine. If they were already clean, they get freed which is also fine. But
direct reclaimers do not call this function so they could be missing
clean and freeable pages which worries me.

> +		/*
> +		 * Put back any unfreeable pages.
> +		 */
> +		while (!list_empty(&page_list)) {
> +			int lru;
> +			struct page *page;
> +
> +			page = lru_to_page(&page_list);
> +			VM_BUG_ON(PageLRU(page));
> +			list_del(&page->lru);
> +			if (!unlikely(!page_evictable(page, NULL))) {
> +				spin_unlock_irq(&zone->lru_lock);
> +				putback_lru_page(page);
> +				spin_lock_irq(&zone->lru_lock);
> +				continue;
> +			}
> +			SetPageLRU(page);
> +			lru = page_lru(page);
> +			add_page_to_lru_list(zone, page, lru);
> +			if (!pagevec_add(&pvec, page)) {
> +				spin_unlock_irq(&zone->lru_lock);
> +				__pagevec_release(&pvec);
> +				spin_lock_irq(&zone->lru_lock);
> +			}
> +		}
> +		__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
> +		__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
> +		nr_pageout += nr_taken - nr_freed;
> +		nr_cleaned += nr_freed;
> +		if (nr_pageout > SWAP_CLUSTER_MAX) {
> +			/* there are remaining I/Os */
> +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> +			nr_pageout /= 2;
> +		}
> +	} while(nr_cleaned < SWAP_CLUSTER_MAX);
> +done:
> +	spin_unlock_irq(&zone->lru_lock);
> +	pagevec_release(&pvec);
> +	return;
> +}
> +
>  /*
>   * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
>   * of reclaimed pages
> @@ -1736,6 +1844,9 @@ static bool shrink_zones(int priority, s
>  					sc->nodemask) {
>  		if (!populated_zone(zone))
>  			continue;
> +
> +		if (current_is_kswapd())
> +			shrink_cleaning_list(zone);
>  		/*
>  		 * Take care memory controller reclaiming has small influence
>  		 * to global LRU.
> @@ -2222,6 +2333,42 @@ out:
>  	return sc.nr_reclaimed;
>  }
>  
> +static void launder_pgdat(pg_data_t *pgdat)
> +{
> +	struct zone *zone;
> +	int i;
> +
> +	for (i = 0; i < MAX_NR_ZONES; i++) {
> +
> +		zone = &pgdat->node_zones[i];
> +		if (!populated_zone(zone))
> +			continue;
> +		if (zone_page_state(zone, NR_CLEANING))
> +			break;
> +		shrink_cleaning_list(zone);
> +	}
> +}
> +
> +/*
> + * Find a zone which has cleaning list.
> + */
> +static int need_to_cleaning_node(pg_data_t *pgdat)
> +{
> +	int i;
> +	struct zone *zone;
> +
> +	for (i = 0; i < MAX_NR_ZONES; i++) {
> +
> +		zone = &pgdat->node_zones[i];
> +		if (!populated_zone(zone))
> +			continue;
> +		if (zone_page_state(zone, NR_CLEANING))
> +			break;
> +	}
> +	return (i != MAX_NR_ZONES);
> +}
> +
> +
>  /*
>   * The background pageout daemon, started as a kernel thread
>   * from the init process.
> @@ -2275,7 +2422,9 @@ static int kswapd(void *p)
>  		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
>  		new_order = pgdat->kswapd_max_order;
>  		pgdat->kswapd_max_order = 0;
> -		if (order < new_order) {
> +		if (need_to_cleaning_node(pgdat)) {
> +			launder_pgdat(pgdat);
> +		} else if (order < new_order) {
>  			/*
>  			 * Don't sleep if someone wants a larger 'order'
>  			 * allocation

I see the direction you are thinking of but I have big concerns about clean
pages getting delayed for too long on the LRU_CLEANING pages before kswapd
puts them back in the right place. I think a safer direction would be for
memcg people to investigate Andrea's "switch stack" suggestion.

In the meantime for my own series, memcg now treats dirty pages similar to
lumpy reclaim. It asks flusher threads to clean pages but stalls waiting
for those pages to be cleaned for a time. This is an untested patch on top
of the current series.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5c4f08b..81c6fbe 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -367,10 +367,10 @@ int write_reclaim_page(struct page *page, struct address_space *mapping,
 	return PAGE_SUCCESS;
 }
 
-/* kswapd and memcg can writeback as they are unlikely to overflow stack */
+/* For now, only kswapd can writeback as it will not overflow stack */
 static inline bool reclaim_can_writeback(struct scan_control *sc)
 {
-	return current_is_kswapd() || sc->mem_cgroup != NULL;
+	return current_is_kswapd();
 }
 
 /*
@@ -900,10 +900,11 @@ keep_dirty:
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
-		 * As lumpy reclaim targets specific pages, wait on them
-		 * to be cleaned and try reclaim again for a time.
+		 * As lumpy reclaim and memcg targets specific pages, wait on
+		 * them to be cleaned and try reclaim again.
 		 */
-		if (sync_writeback == PAGEOUT_IO_SYNC) {
+		if (sync_writeback == PAGEOUT_IO_SYNC ||
+							sc->mem_cgroup != NULL) {
 			dirty_isolated++;
 			list_splice(&dirty_pages, page_list);
 			INIT_LIST_HEAD(&dirty_pages);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 14/14] fs,xfs: Allow kswapd to writeback pages
  2010-07-01 10:30           ` Mel Gorman
@ 2010-07-02  6:26             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 105+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-02  6:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Thu, 1 Jul 2010 11:30:32 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> > memcg shouldn't
> > depends on it. If so, memcg should depends on some writeback-thread (as kswapd).
> > ok.
> > 
> > Then, my concern here is that which kswapd we should wake up and how it can stop.
> 
> And also what the consequences are of kswapd being occupied with containers
> instead of the global lists for a time.
> 
yes, we may have to add a thread or workqueue for memcg for isolating workloads.


> > IOW, how kswapd can know a memcg has some remaining writeback and struck on it.
> > 
> 
> Another possibility for memcg would be to visit Andrea's suggestion on
> switching stack in more detail. I still haven't gotten around to this as
> phd stuff is sucking up piles of my time.

Sure.

> > One idea is here. (this patch will not work...not tested at all.)
> > If we can have "victim page list" and kswapd can depend on it to know
> > "which pages should be written", kswapd can know when it should work.
> > 
> > cpu usage by memcg will be a new problem...but...
> > 
> > ==
> > Add a new LRU "CLEANING" and make kswapd launder it.
> > This patch also changes PG_reclaim behavior. New PG_reclaim works
> > as
> >  - If PG_reclaim is set, a page is on CLEAINING LIST.
> > 
> > And when kswapd launder a page
> >  - issue an writeback. (I'm thinking whehter I should put this
> >    cleaned page back to CLEANING lru and free it later.) 
> >  - if it can free directly, free it.
> > This just use current shrink_list().
> > 
> > Maybe this patch itself inlcludes many bad point...
> > 
> > ---
> >  fs/proc/meminfo.c         |    2 
> >  include/linux/mm_inline.h |    9 ++
> >  include/linux/mmzone.h    |    7 ++
> >  mm/filemap.c              |    3 
> >  mm/internal.h             |    1 
> >  mm/page-writeback.c       |    1 
> >  mm/page_io.c              |    1 
> >  mm/swap.c                 |   31 ++-------
> >  mm/vmscan.c               |  153 +++++++++++++++++++++++++++++++++++++++++++++-
> >  9 files changed, 176 insertions(+), 32 deletions(-)
> > 
> > Index: mmotm-0611/include/linux/mmzone.h
> > ===================================================================
> > --- mmotm-0611.orig/include/linux/mmzone.h
> > +++ mmotm-0611/include/linux/mmzone.h
> > @@ -85,6 +85,7 @@ enum zone_stat_item {
> >  	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
> >  	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
> >  	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
> > +	NR_CLEANING,		/*  "     "     "   "       "         */
> >  	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
> >  	NR_ANON_PAGES,	/* Mapped anonymous pages */
> >  	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
> > @@ -133,6 +134,7 @@ enum lru_list {
> >  	LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
> >  	LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
> >  	LRU_UNEVICTABLE,
> > +	LRU_CLEANING,

 
> > +static inline int is_cleaning_lru(enum lru_list l)
> > +{
> > +	return (l == LRU_CLEANING);
> > +}
> > +
> 
> Nit - LRU_CLEAN_PENDING might be clearer as CLEANING implies it is currently
> being cleaned (implying it's the same as NR_WRITEBACK) or is definely dirty
> implying it's the same as NR_DIRTY.
> 
ok.

> >  enum zone_watermarks {
> >  	WMARK_MIN,
> >  	WMARK_LOW,
> > Index: mmotm-0611/include/linux/mm_inline.h
> > ===================================================================
> > --- mmotm-0611.orig/include/linux/mm_inline.h
> > +++ mmotm-0611/include/linux/mm_inline.h
> > @@ -56,7 +56,10 @@ del_page_from_lru(struct zone *zone, str
> >  	enum lru_list l;
> >  
> >  	list_del(&page->lru);
> > -	if (PageUnevictable(page)) {
> > +	if (PageReclaim(page)) {
> > +		ClearPageReclaim(page);
> > +		l = LRU_CLEANING;
> > +	} else if (PageUnevictable(page)) {
> >  		__ClearPageUnevictable(page);
> >  		l = LRU_UNEVICTABLE;
> >  	} else {
> 
> One point of note is that having a LRU cleaning list will alter the aging
> of pages quite a bit.
> 
yes.

> A slightly greater concern is that clean pages can be temporarily "lost"
> on the cleaning list. If a direct reclaimer moves pages to the LRU_CLEANING
> list, it's no longer considering those pages even if a flusher thread
> happened to clean those pages before kswapd had a chance. Lets say under
> heavy memory pressure a lot of pages are being dirties and encountered on
> the LRU list. They move to LRU_CLEANING where dirty balancing starts making
> sure they get cleaned but are no longer being reclaimed.
> 
> Of course, I might be wrong but it's not a trivial direction to take.
> 

I hope dirty_ratio at el may help us. But I agree this "hiding" can cause
issue.
IIRC, someone wrote a patch to prevent too many threads enter vmscan..
such kinds of work may be necessary.




> > +/* only called by kswapd to do I/O and put back clean paes to its LRU */
> > +static void shrink_cleaning_list(struct zone *zone)
> > +{

> > +		count_page_types(&page_list, count, 0);
> > +		nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
> > +		nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
> > +		__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
> > +		__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
> > +
> > +		nr_freed = shrink_page_list(&page_list, &sc, PAGEOUT_IO_ASYNC);
> 
> So, at this point the isolated pages are cleaned and put back which is
> fine. If they were already clean, they get freed which is also fine. But
> direct reclaimers do not call this function so they could be missing
> clean and freeable pages which worries me.
> 

Hmm. I have to be afraid of that...my first thought was adding klaunderd
and add waitqueue between klaunderd and direct-reclamers.
I used kswapd to make the whole simple but I wonder we need some waitq
if we're afraid that all pages are under I/O! case.


> > +		/*
> > +		 * Put back any unfreeable pages.
> > +		 */

> >  /*
> >   * The background pageout daemon, started as a kernel thread
> >   * from the init process.
> > @@ -2275,7 +2422,9 @@ static int kswapd(void *p)
> >  		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> >  		new_order = pgdat->kswapd_max_order;
> >  		pgdat->kswapd_max_order = 0;
> > -		if (order < new_order) {
> > +		if (need_to_cleaning_node(pgdat)) {
> > +			launder_pgdat(pgdat);
> > +		} else if (order < new_order) {
> >  			/*
> >  			 * Don't sleep if someone wants a larger 'order'
> >  			 * allocation
> 
> I see the direction you are thinking of but I have big concerns about clean
> pages getting delayed for too long on the LRU_CLEANING pages before kswapd
> puts them back in the right place. I think a safer direction would be for
> memcg people to investigate Andrea's "switch stack" suggestion.
> 
Hmm, I may have to consider that. My concern is that IRQ's switch-stack works
well just because no-task-switch in IRQ routine. (I'm sorry if I misunderstand.)

One possibility for memcg will be limit the number of reclaimers who can use
__GFP_FS and use shared stack per cpu per memcg.

Hmm. yet another per-memcg memory shrinker may sound good. 2 years ago, I wrote
a patch to do high-low-watermark memory shirker thread for memcg.
  
  - limit
  - high
  - low

start memory reclaim/writeback when usage exceeds "high" and stop it is below
"low". Implementing this with thread pool can be a choice.



> In the meantime for my own series, memcg now treats dirty pages similar to
> lumpy reclaim. It asks flusher threads to clean pages but stalls waiting
> for those pages to be cleaned for a time. This is an untested patch on top
> of the current series.
> 

Wow...Doesn't this make memcg too slow ? Anyway, memcg should kick flusher
threads..or something, needs other works, too.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 14/14] fs,xfs: Allow kswapd to writeback pages
@ 2010-07-02  6:26             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 105+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-02  6:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Thu, 1 Jul 2010 11:30:32 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> > memcg shouldn't
> > depends on it. If so, memcg should depends on some writeback-thread (as kswapd).
> > ok.
> > 
> > Then, my concern here is that which kswapd we should wake up and how it can stop.
> 
> And also what the consequences are of kswapd being occupied with containers
> instead of the global lists for a time.
> 
yes, we may have to add a thread or workqueue for memcg for isolating workloads.


> > IOW, how kswapd can know a memcg has some remaining writeback and struck on it.
> > 
> 
> Another possibility for memcg would be to visit Andrea's suggestion on
> switching stack in more detail. I still haven't gotten around to this as
> phd stuff is sucking up piles of my time.

Sure.

> > One idea is here. (this patch will not work...not tested at all.)
> > If we can have "victim page list" and kswapd can depend on it to know
> > "which pages should be written", kswapd can know when it should work.
> > 
> > cpu usage by memcg will be a new problem...but...
> > 
> > ==
> > Add a new LRU "CLEANING" and make kswapd launder it.
> > This patch also changes PG_reclaim behavior. New PG_reclaim works
> > as
> >  - If PG_reclaim is set, a page is on CLEAINING LIST.
> > 
> > And when kswapd launder a page
> >  - issue an writeback. (I'm thinking whehter I should put this
> >    cleaned page back to CLEANING lru and free it later.) 
> >  - if it can free directly, free it.
> > This just use current shrink_list().
> > 
> > Maybe this patch itself inlcludes many bad point...
> > 
> > ---
> >  fs/proc/meminfo.c         |    2 
> >  include/linux/mm_inline.h |    9 ++
> >  include/linux/mmzone.h    |    7 ++
> >  mm/filemap.c              |    3 
> >  mm/internal.h             |    1 
> >  mm/page-writeback.c       |    1 
> >  mm/page_io.c              |    1 
> >  mm/swap.c                 |   31 ++-------
> >  mm/vmscan.c               |  153 +++++++++++++++++++++++++++++++++++++++++++++-
> >  9 files changed, 176 insertions(+), 32 deletions(-)
> > 
> > Index: mmotm-0611/include/linux/mmzone.h
> > ===================================================================
> > --- mmotm-0611.orig/include/linux/mmzone.h
> > +++ mmotm-0611/include/linux/mmzone.h
> > @@ -85,6 +85,7 @@ enum zone_stat_item {
> >  	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
> >  	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
> >  	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
> > +	NR_CLEANING,		/*  "     "     "   "       "         */
> >  	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
> >  	NR_ANON_PAGES,	/* Mapped anonymous pages */
> >  	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
> > @@ -133,6 +134,7 @@ enum lru_list {
> >  	LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
> >  	LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
> >  	LRU_UNEVICTABLE,
> > +	LRU_CLEANING,

 
> > +static inline int is_cleaning_lru(enum lru_list l)
> > +{
> > +	return (l == LRU_CLEANING);
> > +}
> > +
> 
> Nit - LRU_CLEAN_PENDING might be clearer as CLEANING implies it is currently
> being cleaned (implying it's the same as NR_WRITEBACK) or is definely dirty
> implying it's the same as NR_DIRTY.
> 
ok.

> >  enum zone_watermarks {
> >  	WMARK_MIN,
> >  	WMARK_LOW,
> > Index: mmotm-0611/include/linux/mm_inline.h
> > ===================================================================
> > --- mmotm-0611.orig/include/linux/mm_inline.h
> > +++ mmotm-0611/include/linux/mm_inline.h
> > @@ -56,7 +56,10 @@ del_page_from_lru(struct zone *zone, str
> >  	enum lru_list l;
> >  
> >  	list_del(&page->lru);
> > -	if (PageUnevictable(page)) {
> > +	if (PageReclaim(page)) {
> > +		ClearPageReclaim(page);
> > +		l = LRU_CLEANING;
> > +	} else if (PageUnevictable(page)) {
> >  		__ClearPageUnevictable(page);
> >  		l = LRU_UNEVICTABLE;
> >  	} else {
> 
> One point of note is that having a LRU cleaning list will alter the aging
> of pages quite a bit.
> 
yes.

> A slightly greater concern is that clean pages can be temporarily "lost"
> on the cleaning list. If a direct reclaimer moves pages to the LRU_CLEANING
> list, it's no longer considering those pages even if a flusher thread
> happened to clean those pages before kswapd had a chance. Lets say under
> heavy memory pressure a lot of pages are being dirties and encountered on
> the LRU list. They move to LRU_CLEANING where dirty balancing starts making
> sure they get cleaned but are no longer being reclaimed.
> 
> Of course, I might be wrong but it's not a trivial direction to take.
> 

I hope dirty_ratio at el may help us. But I agree this "hiding" can cause
issue.
IIRC, someone wrote a patch to prevent too many threads enter vmscan..
such kinds of work may be necessary.




> > +/* only called by kswapd to do I/O and put back clean paes to its LRU */
> > +static void shrink_cleaning_list(struct zone *zone)
> > +{

> > +		count_page_types(&page_list, count, 0);
> > +		nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
> > +		nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
> > +		__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
> > +		__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
> > +
> > +		nr_freed = shrink_page_list(&page_list, &sc, PAGEOUT_IO_ASYNC);
> 
> So, at this point the isolated pages are cleaned and put back which is
> fine. If they were already clean, they get freed which is also fine. But
> direct reclaimers do not call this function so they could be missing
> clean and freeable pages which worries me.
> 

Hmm. I have to be afraid of that...my first thought was adding klaunderd
and add waitqueue between klaunderd and direct-reclamers.
I used kswapd to make the whole simple but I wonder we need some waitq
if we're afraid that all pages are under I/O! case.


> > +		/*
> > +		 * Put back any unfreeable pages.
> > +		 */

> >  /*
> >   * The background pageout daemon, started as a kernel thread
> >   * from the init process.
> > @@ -2275,7 +2422,9 @@ static int kswapd(void *p)
> >  		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> >  		new_order = pgdat->kswapd_max_order;
> >  		pgdat->kswapd_max_order = 0;
> > -		if (order < new_order) {
> > +		if (need_to_cleaning_node(pgdat)) {
> > +			launder_pgdat(pgdat);
> > +		} else if (order < new_order) {
> >  			/*
> >  			 * Don't sleep if someone wants a larger 'order'
> >  			 * allocation
> 
> I see the direction you are thinking of but I have big concerns about clean
> pages getting delayed for too long on the LRU_CLEANING pages before kswapd
> puts them back in the right place. I think a safer direction would be for
> memcg people to investigate Andrea's "switch stack" suggestion.
> 
Hmm, I may have to consider that. My concern is that IRQ's switch-stack works
well just because no-task-switch in IRQ routine. (I'm sorry if I misunderstand.)

One possibility for memcg will be limit the number of reclaimers who can use
__GFP_FS and use shared stack per cpu per memcg.

Hmm. yet another per-memcg memory shrinker may sound good. 2 years ago, I wrote
a patch to do high-low-watermark memory shirker thread for memcg.
  
  - limit
  - high
  - low

start memory reclaim/writeback when usage exceeds "high" and stop it is below
"low". Implementing this with thread pool can be a choice.



> In the meantime for my own series, memcg now treats dirty pages similar to
> lumpy reclaim. It asks flusher threads to clean pages but stalls waiting
> for those pages to be cleaned for a time. This is an untested patch on top
> of the current series.
> 

Wow...Doesn't this make memcg too slow ? Anyway, memcg should kick flusher
threads..or something, needs other works, too.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 14/14] fs,xfs: Allow kswapd to writeback pages
  2010-07-02  6:26             ` KAMEZAWA Hiroyuki
@ 2010-07-02  6:31               ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 105+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-02  6:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Fri, 2 Jul 2010 15:26:43 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > I see the direction you are thinking of but I have big concerns about clean
> > pages getting delayed for too long on the LRU_CLEANING pages before kswapd
> > puts them back in the right place. I think a safer direction would be for
> > memcg people to investigate Andrea's "switch stack" suggestion.
> > 
> Hmm, I may have to consider that. My concern is that IRQ's switch-stack works
> well just because no-task-switch in IRQ routine. (I'm sorry if I misunderstand.)
> 
Ok, I'll think about this 1st.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 14/14] fs,xfs: Allow kswapd to writeback pages
@ 2010-07-02  6:31               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 105+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-02  6:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Fri, 2 Jul 2010 15:26:43 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > I see the direction you are thinking of but I have big concerns about clean
> > pages getting delayed for too long on the LRU_CLEANING pages before kswapd
> > puts them back in the right place. I think a safer direction would be for
> > memcg people to investigate Andrea's "switch stack" suggestion.
> > 
> Hmm, I may have to consider that. My concern is that IRQ's switch-stack works
> well just because no-task-switch in IRQ routine. (I'm sorry if I misunderstand.)
> 
Ok, I'll think about this 1st.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 0/14] Avoid overflowing of stack during page reclaim V3
  2010-06-29 11:34 ` Mel Gorman
@ 2010-07-02 19:33   ` Andrew Morton
  -1 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2010-07-02 19:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli

On Tue, 29 Jun 2010 12:34:34 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> Here is V3 that depends again on flusher threads to do writeback in
> direct reclaim rather than stack switching which is not something I'm
> likely to get done before xfs/btrfs are ignoring writeback in mainline
> (phd sucking up time).

IMO, implemetning stack switching for this is not a good idea.  We
_already_ have a way of doing stack-switching.  It's called
"schedule()".

The only reason I can see for implementing an in-place stack switch
would be if schedule() is too expensive.  And if we were to see
excessive context-switch overheads in this code path (and we won't)
then we should get in there and try to reduce the contect switch rate
first.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 0/14] Avoid overflowing of stack during page reclaim V3
@ 2010-07-02 19:33   ` Andrew Morton
  0 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2010-07-02 19:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli

On Tue, 29 Jun 2010 12:34:34 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> Here is V3 that depends again on flusher threads to do writeback in
> direct reclaim rather than stack switching which is not something I'm
> likely to get done before xfs/btrfs are ignoring writeback in mainline
> (phd sucking up time).

IMO, implemetning stack switching for this is not a good idea.  We
_already_ have a way of doing stack-switching.  It's called
"schedule()".

The only reason I can see for implementing an in-place stack switch
would be if schedule() is too expensive.  And if we were to see
excessive context-switch overheads in this code path (and we won't)
then we should get in there and try to reduce the contect switch rate
first.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
  2010-06-29 11:34   ` Mel Gorman
@ 2010-07-02 19:51     ` Andrew Morton
  -1 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2010-07-02 19:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli

On Tue, 29 Jun 2010 12:34:46 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
> 
> This patch prevents direct reclaim writing back pages by not setting
> may_writepage in scan_control. Instead, dirty pages are placed back on the
> LRU lists for either background writing by the BDI threads or kswapd. If
> in direct lumpy reclaim and dirty pages are encountered, the process will
> stall for the background flusher before trying to reclaim the pages again.
> 
> Memory control groups do not have a kswapd-like thread nor do pages get
> direct reclaimed from the page allocator. Instead, memory control group
> pages are reclaimed when the quota is being exceeded or the group is being
> shrunk. As it is not expected that the entry points into page reclaim are
> deep call chains memcg is still allowed to writeback dirty pages.

I already had "[PATCH 01/14] vmscan: Fix mapping use after free" and
I'll send that in for 2.6.35.

I grabbed [02/14] up to [11/14].  Including "[PATCH 06/14] vmscan: kill
prev_priority completely", grumpyouallsuck.

I wimped out at this, "Do not writeback pages in direct reclaim".  It
really is a profound change and needs a bit more thought, discussion
and if possible testing which is designed to explore possible pathologies.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-02 19:51     ` Andrew Morton
  0 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2010-07-02 19:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli

On Tue, 29 Jun 2010 12:34:46 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
> 
> This patch prevents direct reclaim writing back pages by not setting
> may_writepage in scan_control. Instead, dirty pages are placed back on the
> LRU lists for either background writing by the BDI threads or kswapd. If
> in direct lumpy reclaim and dirty pages are encountered, the process will
> stall for the background flusher before trying to reclaim the pages again.
> 
> Memory control groups do not have a kswapd-like thread nor do pages get
> direct reclaimed from the page allocator. Instead, memory control group
> pages are reclaimed when the quota is being exceeded or the group is being
> shrunk. As it is not expected that the entry points into page reclaim are
> deep call chains memcg is still allowed to writeback dirty pages.

I already had "[PATCH 01/14] vmscan: Fix mapping use after free" and
I'll send that in for 2.6.35.

I grabbed [02/14] up to [11/14].  Including "[PATCH 06/14] vmscan: kill
prev_priority completely", grumpyouallsuck.

I wimped out at this, "Do not writeback pages in direct reclaim".  It
really is a profound change and needs a bit more thought, discussion
and if possible testing which is designed to explore possible pathologies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 0/14] Avoid overflowing of stack during page reclaim V3
  2010-07-02 19:33   ` Andrew Morton
@ 2010-07-05  1:35     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 105+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-05  1:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner,
	Christoph Hellwig, KOSAKI Motohiro, Andrea Arcangeli

On Fri, 2 Jul 2010 12:33:15 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 29 Jun 2010 12:34:34 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > Here is V3 that depends again on flusher threads to do writeback in
> > direct reclaim rather than stack switching which is not something I'm
> > likely to get done before xfs/btrfs are ignoring writeback in mainline
> > (phd sucking up time).
> 
> IMO, implemetning stack switching for this is not a good idea.  We
> _already_ have a way of doing stack-switching.  It's called
> "schedule()".
> 
Sure. 

> The only reason I can see for implementing an in-place stack switch
> would be if schedule() is too expensive.  And if we were to see
> excessive context-switch overheads in this code path (and we won't)
> then we should get in there and try to reduce the contect switch rate
> first.
> 

Maybe a concern of in-place stack exchange lovers is that it's difficult
to guarantee when the pageout() will be issued.

I'd like to try to add a call as

 - pageout_request(page) .... request to pageout a page to a daemon(kswapd).
 - pageout_barrier(zone? node?) .... wait until all writebacks ends.

Implementation dilemmna:

Because page->lru is very useful link to implement calls like above, but
there is a concern that using it will hide pages from vmscan unnecessarily.
Avoding to use of page->lru means to use another structure like pagevec,
but it means page_count()+1 and pins pages unnecessarily. I'm now considering
how to implement safe&scalable way to pageout-in-another-stack(thread)...
I wonder it will require some throttling method for pageout, anyway.

And my own problem is that I should add per-memcg threads or using some
thread-pool ;( But it will be an another topic.

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 0/14] Avoid overflowing of stack during page reclaim V3
@ 2010-07-05  1:35     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 105+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-05  1:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner,
	Christoph Hellwig, KOSAKI Motohiro, Andrea Arcangeli

On Fri, 2 Jul 2010 12:33:15 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 29 Jun 2010 12:34:34 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > Here is V3 that depends again on flusher threads to do writeback in
> > direct reclaim rather than stack switching which is not something I'm
> > likely to get done before xfs/btrfs are ignoring writeback in mainline
> > (phd sucking up time).
> 
> IMO, implemetning stack switching for this is not a good idea.  We
> _already_ have a way of doing stack-switching.  It's called
> "schedule()".
> 
Sure. 

> The only reason I can see for implementing an in-place stack switch
> would be if schedule() is too expensive.  And if we were to see
> excessive context-switch overheads in this code path (and we won't)
> then we should get in there and try to reduce the contect switch rate
> first.
> 

Maybe a concern of in-place stack exchange lovers is that it's difficult
to guarantee when the pageout() will be issued.

I'd like to try to add a call as

 - pageout_request(page) .... request to pageout a page to a daemon(kswapd).
 - pageout_barrier(zone? node?) .... wait until all writebacks ends.

Implementation dilemmna:

Because page->lru is very useful link to implement calls like above, but
there is a concern that using it will hide pages from vmscan unnecessarily.
Avoding to use of page->lru means to use another structure like pagevec,
but it means page_count()+1 and pins pages unnecessarily. I'm now considering
how to implement safe&scalable way to pageout-in-another-stack(thread)...
I wonder it will require some throttling method for pageout, anyway.

And my own problem is that I should add per-memcg threads or using some
thread-pool ;( But it will be an another topic.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
  2010-07-02 19:51     ` Andrew Morton
@ 2010-07-05 13:49       ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-05 13:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli

On Fri, Jul 02, 2010 at 12:51:55PM -0700, Andrew Morton wrote:
> On Tue, 29 Jun 2010 12:34:46 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > When memory is under enough pressure, a process may enter direct
> > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > encountered during the scan, this page is written to backing storage using
> > mapping->writepage. This can result in very deep call stacks, particularly
> > if the target storage or filesystem are complex. It has already been observed
> > on XFS that the stack overflows but the problem is not XFS-specific.
> > 
> > This patch prevents direct reclaim writing back pages by not setting
> > may_writepage in scan_control. Instead, dirty pages are placed back on the
> > LRU lists for either background writing by the BDI threads or kswapd. If
> > in direct lumpy reclaim and dirty pages are encountered, the process will
> > stall for the background flusher before trying to reclaim the pages again.
> > 
> > Memory control groups do not have a kswapd-like thread nor do pages get
> > direct reclaimed from the page allocator. Instead, memory control group
> > pages are reclaimed when the quota is being exceeded or the group is being
> > shrunk. As it is not expected that the entry points into page reclaim are
> > deep call chains memcg is still allowed to writeback dirty pages.
> 
> I already had "[PATCH 01/14] vmscan: Fix mapping use after free" and
> I'll send that in for 2.6.35.
> 

Perfect, thanks.

> I grabbed [02/14] up to [11/14].  Including "[PATCH 06/14] vmscan: kill
> prev_priority completely", grumpyouallsuck.
> 
> I wimped out at this, "Do not writeback pages in direct reclaim".  It
> really is a profound change and needs a bit more thought, discussion
> and if possible testing which is designed to explore possible pathologies.
> 

Ok, that's reasonable as I'm still working on that patch. For example, the
patch disabled anonymous page writeback which is unnecessary as the stack
usage for anon writeback is less than file writeback. Second, using systemtap,
I was able to see that file-backed dirty pages have a tendency to be near the
end of the LRU even though they are a small percentage of the overall pages
in the LRU. I'm hoping to figure out why this is as it would make avoiding
writeback a lot less controversial.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-05 13:49       ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-05 13:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli

On Fri, Jul 02, 2010 at 12:51:55PM -0700, Andrew Morton wrote:
> On Tue, 29 Jun 2010 12:34:46 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > When memory is under enough pressure, a process may enter direct
> > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > encountered during the scan, this page is written to backing storage using
> > mapping->writepage. This can result in very deep call stacks, particularly
> > if the target storage or filesystem are complex. It has already been observed
> > on XFS that the stack overflows but the problem is not XFS-specific.
> > 
> > This patch prevents direct reclaim writing back pages by not setting
> > may_writepage in scan_control. Instead, dirty pages are placed back on the
> > LRU lists for either background writing by the BDI threads or kswapd. If
> > in direct lumpy reclaim and dirty pages are encountered, the process will
> > stall for the background flusher before trying to reclaim the pages again.
> > 
> > Memory control groups do not have a kswapd-like thread nor do pages get
> > direct reclaimed from the page allocator. Instead, memory control group
> > pages are reclaimed when the quota is being exceeded or the group is being
> > shrunk. As it is not expected that the entry points into page reclaim are
> > deep call chains memcg is still allowed to writeback dirty pages.
> 
> I already had "[PATCH 01/14] vmscan: Fix mapping use after free" and
> I'll send that in for 2.6.35.
> 

Perfect, thanks.

> I grabbed [02/14] up to [11/14].  Including "[PATCH 06/14] vmscan: kill
> prev_priority completely", grumpyouallsuck.
> 
> I wimped out at this, "Do not writeback pages in direct reclaim".  It
> really is a profound change and needs a bit more thought, discussion
> and if possible testing which is designed to explore possible pathologies.
> 

Ok, that's reasonable as I'm still working on that patch. For example, the
patch disabled anonymous page writeback which is unnecessary as the stack
usage for anon writeback is less than file writeback. Second, using systemtap,
I was able to see that file-backed dirty pages have a tendency to be near the
end of the LRU even though they are a small percentage of the overall pages
in the LRU. I'm hoping to figure out why this is as it would make avoiding
writeback a lot less controversial.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 14/14] fs,xfs: Allow kswapd to writeback pages
  2010-07-02  6:26             ` KAMEZAWA Hiroyuki
@ 2010-07-05 14:16               ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-05 14:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Fri, Jul 02, 2010 at 03:26:43PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 1 Jul 2010 11:30:32 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> > > memcg shouldn't
> > > depends on it. If so, memcg should depends on some writeback-thread (as kswapd).
> > > ok.
> > > 
> > > Then, my concern here is that which kswapd we should wake up and how it can stop.
> > 
> > And also what the consequences are of kswapd being occupied with containers
> > instead of the global lists for a time.
> > 
>
> yes, we may have to add a thread or workqueue for memcg for isolating workloads.
> 

Possibly, and the closer it is to kswapd behaviour the better I would
imagine but I must warn that I do not have much familiar with the
behaviour of large numbers of memcg entering reclaim.

> > A slightly greater concern is that clean pages can be temporarily "lost"
> > on the cleaning list. If a direct reclaimer moves pages to the LRU_CLEANING
> > list, it's no longer considering those pages even if a flusher thread
> > happened to clean those pages before kswapd had a chance. Lets say under
> > heavy memory pressure a lot of pages are being dirties and encountered on
> > the LRU list. They move to LRU_CLEANING where dirty balancing starts making
> > sure they get cleaned but are no longer being reclaimed.
> > 
> > Of course, I might be wrong but it's not a trivial direction to take.
> > 
> 
> I hope dirty_ratio at el may help us. But I agree this "hiding" can cause
> issue.
> IIRC, someone wrote a patch to prevent too many threads enter vmscan..
> such kinds of work may be necessary.
> 

Using systemtap, I have found in global reclaim at least that the ratio of
dirty to clean pages is not a problem. What does appear to be a problem is
that dirty pages are getting to the end of the inactive file list while
still dirty but I haven't formulated a theory as to why yet - maybe it's
because the dirty balancing is cleaning new pages first?  Right now, I
believe dirty_ratio is working as expected but old dirty pages is a problem.

> > > <SNIP>
> > > @@ -2275,7 +2422,9 @@ static int kswapd(void *p)
> > >  		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> > >  		new_order = pgdat->kswapd_max_order;
> > >  		pgdat->kswapd_max_order = 0;
> > > -		if (order < new_order) {
> > > +		if (need_to_cleaning_node(pgdat)) {
> > > +			launder_pgdat(pgdat);
> > > +		} else if (order < new_order) {
> > >  			/*
> > >  			 * Don't sleep if someone wants a larger 'order'
> > >  			 * allocation
> > 
> > I see the direction you are thinking of but I have big concerns about clean
> > pages getting delayed for too long on the LRU_CLEANING pages before kswapd
> > puts them back in the right place. I think a safer direction would be for
> > memcg people to investigate Andrea's "switch stack" suggestion.
> > 
>
> Hmm, I may have to consider that. My concern is that IRQ's switch-stack works
> well just because no-task-switch in IRQ routine. (I'm sorry if I misunderstand.)
> 
> One possibility for memcg will be limit the number of reclaimers who can use
> __GFP_FS and use shared stack per cpu per memcg.
> 
> Hmm. yet another per-memcg memory shrinker may sound good. 2 years ago, I wrote
> a patch to do high-low-watermark memory shirker thread for memcg.
>   
>   - limit
>   - high
>   - low
> 
> start memory reclaim/writeback when usage exceeds "high" and stop it is below
> "low". Implementing this with thread pool can be a choice.
> 

Indeed, maybe something like a kswapd-memcg thread that is shared between
a configurable number of containers?

> 
> > In the meantime for my own series, memcg now treats dirty pages similar to
> > lumpy reclaim. It asks flusher threads to clean pages but stalls waiting
> > for those pages to be cleaned for a time. This is an untested patch on top
> > of the current series.
> > 
> 
> Wow...Doesn't this make memcg too slow ?

It depends heavily on how often dirty pages are being written back by direct
reclaim. It's not ideal but stalling briefly is better than crashing.
Ideally, the number of dirty pages encountered by direct reclaim would
be so small that it wouldn't matter so I'm looking into that.

> Anyway, memcg should kick flusher
> threads..or something, needs other works, too.
> 

With this patch, the flusher threads get kicked when direct reclaim encounters
pages it cannot clean.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 14/14] fs,xfs: Allow kswapd to writeback pages
@ 2010-07-05 14:16               ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-05 14:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Fri, Jul 02, 2010 at 03:26:43PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 1 Jul 2010 11:30:32 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> > > memcg shouldn't
> > > depends on it. If so, memcg should depends on some writeback-thread (as kswapd).
> > > ok.
> > > 
> > > Then, my concern here is that which kswapd we should wake up and how it can stop.
> > 
> > And also what the consequences are of kswapd being occupied with containers
> > instead of the global lists for a time.
> > 
>
> yes, we may have to add a thread or workqueue for memcg for isolating workloads.
> 

Possibly, and the closer it is to kswapd behaviour the better I would
imagine but I must warn that I do not have much familiar with the
behaviour of large numbers of memcg entering reclaim.

> > A slightly greater concern is that clean pages can be temporarily "lost"
> > on the cleaning list. If a direct reclaimer moves pages to the LRU_CLEANING
> > list, it's no longer considering those pages even if a flusher thread
> > happened to clean those pages before kswapd had a chance. Lets say under
> > heavy memory pressure a lot of pages are being dirties and encountered on
> > the LRU list. They move to LRU_CLEANING where dirty balancing starts making
> > sure they get cleaned but are no longer being reclaimed.
> > 
> > Of course, I might be wrong but it's not a trivial direction to take.
> > 
> 
> I hope dirty_ratio at el may help us. But I agree this "hiding" can cause
> issue.
> IIRC, someone wrote a patch to prevent too many threads enter vmscan..
> such kinds of work may be necessary.
> 

Using systemtap, I have found in global reclaim at least that the ratio of
dirty to clean pages is not a problem. What does appear to be a problem is
that dirty pages are getting to the end of the inactive file list while
still dirty but I haven't formulated a theory as to why yet - maybe it's
because the dirty balancing is cleaning new pages first?  Right now, I
believe dirty_ratio is working as expected but old dirty pages is a problem.

> > > <SNIP>
> > > @@ -2275,7 +2422,9 @@ static int kswapd(void *p)
> > >  		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> > >  		new_order = pgdat->kswapd_max_order;
> > >  		pgdat->kswapd_max_order = 0;
> > > -		if (order < new_order) {
> > > +		if (need_to_cleaning_node(pgdat)) {
> > > +			launder_pgdat(pgdat);
> > > +		} else if (order < new_order) {
> > >  			/*
> > >  			 * Don't sleep if someone wants a larger 'order'
> > >  			 * allocation
> > 
> > I see the direction you are thinking of but I have big concerns about clean
> > pages getting delayed for too long on the LRU_CLEANING pages before kswapd
> > puts them back in the right place. I think a safer direction would be for
> > memcg people to investigate Andrea's "switch stack" suggestion.
> > 
>
> Hmm, I may have to consider that. My concern is that IRQ's switch-stack works
> well just because no-task-switch in IRQ routine. (I'm sorry if I misunderstand.)
> 
> One possibility for memcg will be limit the number of reclaimers who can use
> __GFP_FS and use shared stack per cpu per memcg.
> 
> Hmm. yet another per-memcg memory shrinker may sound good. 2 years ago, I wrote
> a patch to do high-low-watermark memory shirker thread for memcg.
>   
>   - limit
>   - high
>   - low
> 
> start memory reclaim/writeback when usage exceeds "high" and stop it is below
> "low". Implementing this with thread pool can be a choice.
> 

Indeed, maybe something like a kswapd-memcg thread that is shared between
a configurable number of containers?

> 
> > In the meantime for my own series, memcg now treats dirty pages similar to
> > lumpy reclaim. It asks flusher threads to clean pages but stalls waiting
> > for those pages to be cleaned for a time. This is an untested patch on top
> > of the current series.
> > 
> 
> Wow...Doesn't this make memcg too slow ?

It depends heavily on how often dirty pages are being written back by direct
reclaim. It's not ideal but stalling briefly is better than crashing.
Ideally, the number of dirty pages encountered by direct reclaim would
be so small that it wouldn't matter so I'm looking into that.

> Anyway, memcg should kick flusher
> threads..or something, needs other works, too.
> 

With this patch, the flusher threads get kicked when direct reclaim encounters
pages it cannot clean.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
  2010-07-05 13:49       ` Mel Gorman
@ 2010-07-06  0:36         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2010-07-06  0:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

Hello,

> Ok, that's reasonable as I'm still working on that patch. For example, the
> patch disabled anonymous page writeback which is unnecessary as the stack
> usage for anon writeback is less than file writeback. 

How do we examine swap-on-file?

> Second, using systemtap,
> I was able to see that file-backed dirty pages have a tendency to be near the
> end of the LRU even though they are a small percentage of the overall pages
> in the LRU. I'm hoping to figure out why this is as it would make avoiding
> writeback a lot less controversial.





^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-06  0:36         ` KOSAKI Motohiro
  0 siblings, 0 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2010-07-06  0:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

Hello,

> Ok, that's reasonable as I'm still working on that patch. For example, the
> patch disabled anonymous page writeback which is unnecessary as the stack
> usage for anon writeback is less than file writeback. 

How do we examine swap-on-file?

> Second, using systemtap,
> I was able to see that file-backed dirty pages have a tendency to be near the
> end of the LRU even though they are a small percentage of the overall pages
> in the LRU. I'm hoping to figure out why this is as it would make avoiding
> writeback a lot less controversial.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 14/14] fs,xfs: Allow kswapd to writeback pages
  2010-07-05 14:16               ` Mel Gorman
@ 2010-07-06  0:45                 ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 105+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-06  0:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, 5 Jul 2010 15:16:40 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> > > A slightly greater concern is that clean pages can be temporarily "lost"
> > > on the cleaning list. If a direct reclaimer moves pages to the LRU_CLEANING
> > > list, it's no longer considering those pages even if a flusher thread
> > > happened to clean those pages before kswapd had a chance. Lets say under
> > > heavy memory pressure a lot of pages are being dirties and encountered on
> > > the LRU list. They move to LRU_CLEANING where dirty balancing starts making
> > > sure they get cleaned but are no longer being reclaimed.
> > > 
> > > Of course, I might be wrong but it's not a trivial direction to take.
> > > 
> > 
> > I hope dirty_ratio at el may help us. But I agree this "hiding" can cause
> > issue.
> > IIRC, someone wrote a patch to prevent too many threads enter vmscan..
> > such kinds of work may be necessary.
> > 
> 
> Using systemtap, I have found in global reclaim at least that the ratio of
> dirty to clean pages is not a problem. What does appear to be a problem is
> that dirty pages are getting to the end of the inactive file list while
> still dirty but I haven't formulated a theory as to why yet - maybe it's
> because the dirty balancing is cleaning new pages first?  Right now, I
> believe dirty_ratio is working as expected but old dirty pages is a problem.
> 

Hmm. IIUC, dirty pages put back to the tail of LRU will be moved to the head
when writeback finishs if PG_reclaim is set. This is maybe for finding clean
pages in the next vmscan.


> > > > <SNIP>
> > > > @@ -2275,7 +2422,9 @@ static int kswapd(void *p)
> > > >  		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> > > >  		new_order = pgdat->kswapd_max_order;
> > > >  		pgdat->kswapd_max_order = 0;
> > > > -		if (order < new_order) {
> > > > +		if (need_to_cleaning_node(pgdat)) {
> > > > +			launder_pgdat(pgdat);
> > > > +		} else if (order < new_order) {
> > > >  			/*
> > > >  			 * Don't sleep if someone wants a larger 'order'
> > > >  			 * allocation
> > > 
> > > I see the direction you are thinking of but I have big concerns about clean
> > > pages getting delayed for too long on the LRU_CLEANING pages before kswapd
> > > puts them back in the right place. I think a safer direction would be for
> > > memcg people to investigate Andrea's "switch stack" suggestion.
> > > 
> >
> > Hmm, I may have to consider that. My concern is that IRQ's switch-stack works
> > well just because no-task-switch in IRQ routine. (I'm sorry if I misunderstand.)
> > 
> > One possibility for memcg will be limit the number of reclaimers who can use
> > __GFP_FS and use shared stack per cpu per memcg.
> > 
> > Hmm. yet another per-memcg memory shrinker may sound good. 2 years ago, I wrote
> > a patch to do high-low-watermark memory shirker thread for memcg.
> >   
> >   - limit
> >   - high
> >   - low
> > 
> > start memory reclaim/writeback when usage exceeds "high" and stop it is below
> > "low". Implementing this with thread pool can be a choice.
> > 
> 
> Indeed, maybe something like a kswapd-memcg thread that is shared between
> a configurable number of containers?
> 
yes, I consider that style. I like something automatic configration but peopl
may want knobs.



> > 
> > > In the meantime for my own series, memcg now treats dirty pages similar to
> > > lumpy reclaim. It asks flusher threads to clean pages but stalls waiting
> > > for those pages to be cleaned for a time. This is an untested patch on top
> > > of the current series.
> > > 
> > 
> > Wow...Doesn't this make memcg too slow ?
> 
> It depends heavily on how often dirty pages are being written back by direct
> reclaim. It's not ideal but stalling briefly is better than crashing.
> Ideally, the number of dirty pages encountered by direct reclaim would
> be so small that it wouldn't matter so I'm looking into that.
> 
ok.

> > Anyway, memcg should kick flusher
> > threads..or something, needs other works, too.
> > 
> 
> With this patch, the flusher threads get kicked when direct reclaim encounters
> pages it cannot clean.
> 
Ah, I missed that. thanks.

-Kame



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 14/14] fs,xfs: Allow kswapd to writeback pages
@ 2010-07-06  0:45                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 105+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-06  0:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	Andrea Arcangeli

On Mon, 5 Jul 2010 15:16:40 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> > > A slightly greater concern is that clean pages can be temporarily "lost"
> > > on the cleaning list. If a direct reclaimer moves pages to the LRU_CLEANING
> > > list, it's no longer considering those pages even if a flusher thread
> > > happened to clean those pages before kswapd had a chance. Lets say under
> > > heavy memory pressure a lot of pages are being dirties and encountered on
> > > the LRU list. They move to LRU_CLEANING where dirty balancing starts making
> > > sure they get cleaned but are no longer being reclaimed.
> > > 
> > > Of course, I might be wrong but it's not a trivial direction to take.
> > > 
> > 
> > I hope dirty_ratio at el may help us. But I agree this "hiding" can cause
> > issue.
> > IIRC, someone wrote a patch to prevent too many threads enter vmscan..
> > such kinds of work may be necessary.
> > 
> 
> Using systemtap, I have found in global reclaim at least that the ratio of
> dirty to clean pages is not a problem. What does appear to be a problem is
> that dirty pages are getting to the end of the inactive file list while
> still dirty but I haven't formulated a theory as to why yet - maybe it's
> because the dirty balancing is cleaning new pages first?  Right now, I
> believe dirty_ratio is working as expected but old dirty pages is a problem.
> 

Hmm. IIUC, dirty pages put back to the tail of LRU will be moved to the head
when writeback finishs if PG_reclaim is set. This is maybe for finding clean
pages in the next vmscan.


> > > > <SNIP>
> > > > @@ -2275,7 +2422,9 @@ static int kswapd(void *p)
> > > >  		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> > > >  		new_order = pgdat->kswapd_max_order;
> > > >  		pgdat->kswapd_max_order = 0;
> > > > -		if (order < new_order) {
> > > > +		if (need_to_cleaning_node(pgdat)) {
> > > > +			launder_pgdat(pgdat);
> > > > +		} else if (order < new_order) {
> > > >  			/*
> > > >  			 * Don't sleep if someone wants a larger 'order'
> > > >  			 * allocation
> > > 
> > > I see the direction you are thinking of but I have big concerns about clean
> > > pages getting delayed for too long on the LRU_CLEANING pages before kswapd
> > > puts them back in the right place. I think a safer direction would be for
> > > memcg people to investigate Andrea's "switch stack" suggestion.
> > > 
> >
> > Hmm, I may have to consider that. My concern is that IRQ's switch-stack works
> > well just because no-task-switch in IRQ routine. (I'm sorry if I misunderstand.)
> > 
> > One possibility for memcg will be limit the number of reclaimers who can use
> > __GFP_FS and use shared stack per cpu per memcg.
> > 
> > Hmm. yet another per-memcg memory shrinker may sound good. 2 years ago, I wrote
> > a patch to do high-low-watermark memory shirker thread for memcg.
> >   
> >   - limit
> >   - high
> >   - low
> > 
> > start memory reclaim/writeback when usage exceeds "high" and stop it is below
> > "low". Implementing this with thread pool can be a choice.
> > 
> 
> Indeed, maybe something like a kswapd-memcg thread that is shared between
> a configurable number of containers?
> 
yes, I consider that style. I like something automatic configration but peopl
may want knobs.



> > 
> > > In the meantime for my own series, memcg now treats dirty pages similar to
> > > lumpy reclaim. It asks flusher threads to clean pages but stalls waiting
> > > for those pages to be cleaned for a time. This is an untested patch on top
> > > of the current series.
> > > 
> > 
> > Wow...Doesn't this make memcg too slow ?
> 
> It depends heavily on how often dirty pages are being written back by direct
> reclaim. It's not ideal but stalling briefly is better than crashing.
> Ideally, the number of dirty pages encountered by direct reclaim would
> be so small that it wouldn't matter so I'm looking into that.
> 
ok.

> > Anyway, memcg should kick flusher
> > threads..or something, needs other works, too.
> > 
> 
> With this patch, the flusher threads get kicked when direct reclaim encounters
> pages it cannot clean.
> 
Ah, I missed that. thanks.

-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
  2010-07-06  0:36         ` KOSAKI Motohiro
@ 2010-07-06  5:46           ` Minchan Kim
  -1 siblings, 0 replies; 105+ messages in thread
From: Minchan Kim @ 2010-07-06  5:46 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

On Tue, Jul 6, 2010 at 9:36 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hello,
>
>> Ok, that's reasonable as I'm still working on that patch. For example, the
>> patch disabled anonymous page writeback which is unnecessary as the stack
>> usage for anon writeback is less than file writeback.
>
> How do we examine swap-on-file?

bool is_swap_on_file(struct page *page)
{
    struct swap_info_struct *p;
    swp_entry_entry entry;
    entry.val = page_private(page);
    p = swap_info_get(entry);
    return !(p->flags & SWP_BLKDEV)
}

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-06  5:46           ` Minchan Kim
  0 siblings, 0 replies; 105+ messages in thread
From: Minchan Kim @ 2010-07-06  5:46 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

On Tue, Jul 6, 2010 at 9:36 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hello,
>
>> Ok, that's reasonable as I'm still working on that patch. For example, the
>> patch disabled anonymous page writeback which is unnecessary as the stack
>> usage for anon writeback is less than file writeback.
>
> How do we examine swap-on-file?

bool is_swap_on_file(struct page *page)
{
    struct swap_info_struct *p;
    swp_entry_entry entry;
    entry.val = page_private(page);
    p = swap_info_get(entry);
    return !(p->flags & SWP_BLKDEV)
}

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
  2010-07-06  5:46           ` Minchan Kim
@ 2010-07-06  6:02             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2010-07-06  6:02 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, Mel Gorman, Andrew Morton, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrea Arcangeli

> On Tue, Jul 6, 2010 at 9:36 AM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> > Hello,
> >
> >> Ok, that's reasonable as I'm still working on that patch. For example, the
> >> patch disabled anonymous page writeback which is unnecessary as the stack
> >> usage for anon writeback is less than file writeback.
> >
> > How do we examine swap-on-file?
> 
> bool is_swap_on_file(struct page *page)
> {
>     struct swap_info_struct *p;
>     swp_entry_entry entry;
>     entry.val = page_private(page);
>     p = swap_info_get(entry);
>     return !(p->flags & SWP_BLKDEV)
> }

Well, do you suggested we traverse all pages in lru _before_
starting vmscan?




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-06  6:02             ` KOSAKI Motohiro
  0 siblings, 0 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2010-07-06  6:02 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, Mel Gorman, Andrew Morton, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrea Arcangeli

> On Tue, Jul 6, 2010 at 9:36 AM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> > Hello,
> >
> >> Ok, that's reasonable as I'm still working on that patch. For example, the
> >> patch disabled anonymous page writeback which is unnecessary as the stack
> >> usage for anon writeback is less than file writeback.
> >
> > How do we examine swap-on-file?
> 
> bool is_swap_on_file(struct page *page)
> {
>     struct swap_info_struct *p;
>     swp_entry_entry entry;
>     entry.val = page_private(page);
>     p = swap_info_get(entry);
>     return !(p->flags & SWP_BLKDEV)
> }

Well, do you suggested we traverse all pages in lru _before_
starting vmscan?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
  2010-07-06  6:02             ` KOSAKI Motohiro
@ 2010-07-06  6:38               ` Minchan Kim
  -1 siblings, 0 replies; 105+ messages in thread
From: Minchan Kim @ 2010-07-06  6:38 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

On Tue, Jul 6, 2010 at 3:02 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> On Tue, Jul 6, 2010 at 9:36 AM, KOSAKI Motohiro
>> <kosaki.motohiro@jp.fujitsu.com> wrote:
>> > Hello,
>> >
>> >> Ok, that's reasonable as I'm still working on that patch. For example, the
>> >> patch disabled anonymous page writeback which is unnecessary as the stack
>> >> usage for anon writeback is less than file writeback.
>> >
>> > How do we examine swap-on-file?
>>
>> bool is_swap_on_file(struct page *page)
>> {
>>     struct swap_info_struct *p;
>>     swp_entry_entry entry;
>>     entry.val = page_private(page);
>>     p = swap_info_get(entry);
>>     return !(p->flags & SWP_BLKDEV)
>> }
>
> Well, do you suggested we traverse all pages in lru _before_
> starting vmscan?
>

No. I don't suggest anything.
What I say is just we can do it.
If we have to implement it, Couldn't we do it in write_reclaim_page?



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-06  6:38               ` Minchan Kim
  0 siblings, 0 replies; 105+ messages in thread
From: Minchan Kim @ 2010-07-06  6:38 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

On Tue, Jul 6, 2010 at 3:02 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> On Tue, Jul 6, 2010 at 9:36 AM, KOSAKI Motohiro
>> <kosaki.motohiro@jp.fujitsu.com> wrote:
>> > Hello,
>> >
>> >> Ok, that's reasonable as I'm still working on that patch. For example, the
>> >> patch disabled anonymous page writeback which is unnecessary as the stack
>> >> usage for anon writeback is less than file writeback.
>> >
>> > How do we examine swap-on-file?
>>
>> bool is_swap_on_file(struct page *page)
>> {
>>     struct swap_info_struct *p;
>>     swp_entry_entry entry;
>>     entry.val = page_private(page);
>>     p = swap_info_get(entry);
>>     return !(p->flags & SWP_BLKDEV)
>> }
>
> Well, do you suggested we traverse all pages in lru _before_
> starting vmscan?
>

No. I don't suggest anything.
What I say is just we can do it.
If we have to implement it, Couldn't we do it in write_reclaim_page?



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
  2010-07-06  0:36         ` KOSAKI Motohiro
@ 2010-07-06 10:12           ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-06 10:12 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

On Tue, Jul 06, 2010 at 09:36:41AM +0900, KOSAKI Motohiro wrote:
> Hello,
> 
> > Ok, that's reasonable as I'm still working on that patch. For example, the
> > patch disabled anonymous page writeback which is unnecessary as the stack
> > usage for anon writeback is less than file writeback. 
> 
> How do we examine swap-on-file?
> 

Anything in particular wrong with the following?

/*
 * For now, only kswapd can writeback filesystem pages as otherwise
 * there is a stack overflow risk
 */
static inline bool reclaim_can_writeback(struct scan_control *sc,
                                        struct page *page)
{
        return !page_is_file_cache(page) || current_is_kswapd();
}

Even if it is a swapfile, I didn't spot a case where the filesystems
writepage would be called. Did I miss something?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-06 10:12           ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-06 10:12 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

On Tue, Jul 06, 2010 at 09:36:41AM +0900, KOSAKI Motohiro wrote:
> Hello,
> 
> > Ok, that's reasonable as I'm still working on that patch. For example, the
> > patch disabled anonymous page writeback which is unnecessary as the stack
> > usage for anon writeback is less than file writeback. 
> 
> How do we examine swap-on-file?
> 

Anything in particular wrong with the following?

/*
 * For now, only kswapd can writeback filesystem pages as otherwise
 * there is a stack overflow risk
 */
static inline bool reclaim_can_writeback(struct scan_control *sc,
                                        struct page *page)
{
        return !page_is_file_cache(page) || current_is_kswapd();
}

Even if it is a swapfile, I didn't spot a case where the filesystems
writepage would be called. Did I miss something?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
  2010-07-06 10:12           ` Mel Gorman
@ 2010-07-06 11:13             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2010-07-06 11:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

> On Tue, Jul 06, 2010 at 09:36:41AM +0900, KOSAKI Motohiro wrote:
> > Hello,
> > 
> > > Ok, that's reasonable as I'm still working on that patch. For example, the
> > > patch disabled anonymous page writeback which is unnecessary as the stack
> > > usage for anon writeback is less than file writeback. 
> > 
> > How do we examine swap-on-file?
> > 
> 
> Anything in particular wrong with the following?
> 
> /*
>  * For now, only kswapd can writeback filesystem pages as otherwise
>  * there is a stack overflow risk
>  */
> static inline bool reclaim_can_writeback(struct scan_control *sc,
>                                         struct page *page)
> {
>         return !page_is_file_cache(page) || current_is_kswapd();
> }
> 
> Even if it is a swapfile, I didn't spot a case where the filesystems
> writepage would be called. Did I miss something?

Hmm...

Now, I doubt I don't understand your mention. Do you mean you intend to swtich task
stack when every writepage? It seems a bit costly. but otherwise write-page for anon
makes filesystem IO and stack-overflow.

Can you please elaborate your plan?




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-06 11:13             ` KOSAKI Motohiro
  0 siblings, 0 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2010-07-06 11:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

> On Tue, Jul 06, 2010 at 09:36:41AM +0900, KOSAKI Motohiro wrote:
> > Hello,
> > 
> > > Ok, that's reasonable as I'm still working on that patch. For example, the
> > > patch disabled anonymous page writeback which is unnecessary as the stack
> > > usage for anon writeback is less than file writeback. 
> > 
> > How do we examine swap-on-file?
> > 
> 
> Anything in particular wrong with the following?
> 
> /*
>  * For now, only kswapd can writeback filesystem pages as otherwise
>  * there is a stack overflow risk
>  */
> static inline bool reclaim_can_writeback(struct scan_control *sc,
>                                         struct page *page)
> {
>         return !page_is_file_cache(page) || current_is_kswapd();
> }
> 
> Even if it is a swapfile, I didn't spot a case where the filesystems
> writepage would be called. Did I miss something?

Hmm...

Now, I doubt I don't understand your mention. Do you mean you intend to swtich task
stack when every writepage? It seems a bit costly. but otherwise write-page for anon
makes filesystem IO and stack-overflow.

Can you please elaborate your plan?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
  2010-07-06 10:12           ` Mel Gorman
@ 2010-07-06 11:24             ` Minchan Kim
  -1 siblings, 0 replies; 105+ messages in thread
From: Minchan Kim @ 2010-07-06 11:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Andrew Morton, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

Hi, Mel.

On Tue, Jul 6, 2010 at 7:12 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Tue, Jul 06, 2010 at 09:36:41AM +0900, KOSAKI Motohiro wrote:
>> Hello,
>>
>> > Ok, that's reasonable as I'm still working on that patch. For example, the
>> > patch disabled anonymous page writeback which is unnecessary as the stack
>> > usage for anon writeback is less than file writeback.
>>
>> How do we examine swap-on-file?
>>
>
> Anything in particular wrong with the following?
>
> /*
>  * For now, only kswapd can writeback filesystem pages as otherwise
>  * there is a stack overflow risk
>  */
> static inline bool reclaim_can_writeback(struct scan_control *sc,
>                                        struct page *page)
> {
>        return !page_is_file_cache(page) || current_is_kswapd();
> }
>
> Even if it is a swapfile, I didn't spot a case where the filesystems
> writepage would be called. Did I miss something?


As I understand Kosaki's opinion, He said that if we make swapout in
pageout, it isn't a problem in case of swap device since swapout of
block device is light but it is still problem in case of swap file.
That's because swapout on swapfile cause file system writepage which
makes kernel stack overflow.

Do I misunderstand kosaki's point?


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-06 11:24             ` Minchan Kim
  0 siblings, 0 replies; 105+ messages in thread
From: Minchan Kim @ 2010-07-06 11:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Andrew Morton, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

Hi, Mel.

On Tue, Jul 6, 2010 at 7:12 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Tue, Jul 06, 2010 at 09:36:41AM +0900, KOSAKI Motohiro wrote:
>> Hello,
>>
>> > Ok, that's reasonable as I'm still working on that patch. For example, the
>> > patch disabled anonymous page writeback which is unnecessary as the stack
>> > usage for anon writeback is less than file writeback.
>>
>> How do we examine swap-on-file?
>>
>
> Anything in particular wrong with the following?
>
> /*
>  * For now, only kswapd can writeback filesystem pages as otherwise
>  * there is a stack overflow risk
>  */
> static inline bool reclaim_can_writeback(struct scan_control *sc,
>                                        struct page *page)
> {
>        return !page_is_file_cache(page) || current_is_kswapd();
> }
>
> Even if it is a swapfile, I didn't spot a case where the filesystems
> writepage would be called. Did I miss something?


As I understand Kosaki's opinion, He said that if we make swapout in
pageout, it isn't a problem in case of swap device since swapout of
block device is light but it is still problem in case of swap file.
That's because swapout on swapfile cause file system writepage which
makes kernel stack overflow.

Do I misunderstand kosaki's point?


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
  2010-07-06 11:24             ` Minchan Kim
  (?)
@ 2010-07-06 15:25               ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-06 15:25 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Andrew Morton, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

On Tue, Jul 06, 2010 at 08:24:57PM +0900, Minchan Kim wrote:
> Hi, Mel.
> 
> On Tue, Jul 6, 2010 at 7:12 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > On Tue, Jul 06, 2010 at 09:36:41AM +0900, KOSAKI Motohiro wrote:
> >> Hello,
> >>
> >> > Ok, that's reasonable as I'm still working on that patch. For example, the
> >> > patch disabled anonymous page writeback which is unnecessary as the stack
> >> > usage for anon writeback is less than file writeback.
> >>
> >> How do we examine swap-on-file?
> >>
> >
> > Anything in particular wrong with the following?
> >
> > /*
> >  * For now, only kswapd can writeback filesystem pages as otherwise
> >  * there is a stack overflow risk
> >  */
> > static inline bool reclaim_can_writeback(struct scan_control *sc,
> >                                        struct page *page)
> > {
> >        return !page_is_file_cache(page) || current_is_kswapd();
> > }
> >
> > Even if it is a swapfile, I didn't spot a case where the filesystems
> > writepage would be called. Did I miss something?
> 
> 
> As I understand Kosaki's opinion, He said that if we make swapout in
> pageout, it isn't a problem in case of swap device since swapout of
> block device is light

Sure

> but it is still problem in case of swap file.
> That's because swapout on swapfile cause file system writepage which
> makes kernel stack overflow.
> 

I don't *think* this is a problem unless I missed where writing out to
swap enters teh filesystem code. I'll double check.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-06 15:25               ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-06 15:25 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Andrew Morton, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

On Tue, Jul 06, 2010 at 08:24:57PM +0900, Minchan Kim wrote:
> Hi, Mel.
> 
> On Tue, Jul 6, 2010 at 7:12 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > On Tue, Jul 06, 2010 at 09:36:41AM +0900, KOSAKI Motohiro wrote:
> >> Hello,
> >>
> >> > Ok, that's reasonable as I'm still working on that patch. For example, the
> >> > patch disabled anonymous page writeback which is unnecessary as the stack
> >> > usage for anon writeback is less than file writeback.
> >>
> >> How do we examine swap-on-file?
> >>
> >
> > Anything in particular wrong with the following?
> >
> > /*
> >  * For now, only kswapd can writeback filesystem pages as otherwise
> >  * there is a stack overflow risk
> >  */
> > static inline bool reclaim_can_writeback(struct scan_control *sc,
> >                                        struct page *page)
> > {
> >        return !page_is_file_cache(page) || current_is_kswapd();
> > }
> >
> > Even if it is a swapfile, I didn't spot a case where the filesystems
> > writepage would be called. Did I miss something?
> 
> 
> As I understand Kosaki's opinion, He said that if we make swapout in
> pageout, it isn't a problem in case of swap device since swapout of
> block device is light

Sure

> but it is still problem in case of swap file.
> That's because swapout on swapfile cause file system writepage which
> makes kernel stack overflow.
> 

I don't *think* this is a problem unless I missed where writing out to
swap enters teh filesystem code. I'll double check.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-06 15:25               ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-06 15:25 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Andrew Morton, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

On Tue, Jul 06, 2010 at 08:24:57PM +0900, Minchan Kim wrote:
> Hi, Mel.
> 
> On Tue, Jul 6, 2010 at 7:12 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > On Tue, Jul 06, 2010 at 09:36:41AM +0900, KOSAKI Motohiro wrote:
> >> Hello,
> >>
> >> > Ok, that's reasonable as I'm still working on that patch. For example, the
> >> > patch disabled anonymous page writeback which is unnecessary as the stack
> >> > usage for anon writeback is less than file writeback.
> >>
> >> How do we examine swap-on-file?
> >>
> >
> > Anything in particular wrong with the following?
> >
> > /*
> >  * For now, only kswapd can writeback filesystem pages as otherwise
> >  * there is a stack overflow risk
> >  */
> > static inline bool reclaim_can_writeback(struct scan_control *sc,
> >                                        struct page *page)
> > {
> >        return !page_is_file_cache(page) || current_is_kswapd();
> > }
> >
> > Even if it is a swapfile, I didn't spot a case where the filesystems
> > writepage would be called. Did I miss something?
> 
> 
> As I understand Kosaki's opinion, He said that if we make swapout in
> pageout, it isn't a problem in case of swap device since swapout of
> block device is light

Sure

> but it is still problem in case of swap file.
> That's because swapout on swapfile cause file system writepage which
> makes kernel stack overflow.
> 

I don't *think* this is a problem unless I missed where writing out to
swap enters teh filesystem code. I'll double check.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
  2010-07-06 15:25               ` Mel Gorman
@ 2010-07-06 20:27                 ` Johannes Weiner
  -1 siblings, 0 replies; 105+ messages in thread
From: Johannes Weiner @ 2010-07-06 20:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Minchan Kim, KOSAKI Motohiro, Andrew Morton, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

On Tue, Jul 06, 2010 at 04:25:39PM +0100, Mel Gorman wrote:
> On Tue, Jul 06, 2010 at 08:24:57PM +0900, Minchan Kim wrote:
> > but it is still problem in case of swap file.
> > That's because swapout on swapfile cause file system writepage which
> > makes kernel stack overflow.
> 
> I don't *think* this is a problem unless I missed where writing out to
> swap enters teh filesystem code. I'll double check.

It bypasses the fs.  On swapon, the blocks are resolved
(mm/swapfile.c::setup_swap_extents) and then the writeout path uses
bios directly (mm/page_io.c::swap_writepage).

(GFP_NOFS still includes __GFP_IO, so allows swapping)

	Hannes

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-06 20:27                 ` Johannes Weiner
  0 siblings, 0 replies; 105+ messages in thread
From: Johannes Weiner @ 2010-07-06 20:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Minchan Kim, KOSAKI Motohiro, Andrew Morton, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

On Tue, Jul 06, 2010 at 04:25:39PM +0100, Mel Gorman wrote:
> On Tue, Jul 06, 2010 at 08:24:57PM +0900, Minchan Kim wrote:
> > but it is still problem in case of swap file.
> > That's because swapout on swapfile cause file system writepage which
> > makes kernel stack overflow.
> 
> I don't *think* this is a problem unless I missed where writing out to
> swap enters teh filesystem code. I'll double check.

It bypasses the fs.  On swapon, the blocks are resolved
(mm/swapfile.c::setup_swap_extents) and then the writeout path uses
bios directly (mm/page_io.c::swap_writepage).

(GFP_NOFS still includes __GFP_IO, so allows swapping)

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
  2010-07-06 20:27                 ` Johannes Weiner
@ 2010-07-06 22:28                   ` Minchan Kim
  -1 siblings, 0 replies; 105+ messages in thread
From: Minchan Kim @ 2010-07-06 22:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, KOSAKI Motohiro, Andrew Morton, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

On Wed, Jul 7, 2010 at 5:27 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Tue, Jul 06, 2010 at 04:25:39PM +0100, Mel Gorman wrote:
>> On Tue, Jul 06, 2010 at 08:24:57PM +0900, Minchan Kim wrote:
>> > but it is still problem in case of swap file.
>> > That's because swapout on swapfile cause file system writepage which
>> > makes kernel stack overflow.
>>
>> I don't *think* this is a problem unless I missed where writing out to
>> swap enters teh filesystem code. I'll double check.
>
> It bypasses the fs.  On swapon, the blocks are resolved
> (mm/swapfile.c::setup_swap_extents) and then the writeout path uses
> bios directly (mm/page_io.c::swap_writepage).
>
> (GFP_NOFS still includes __GFP_IO, so allows swapping)
>
>        Hannes

Thanks, Hannes. You're right.
Extents would be resolved by setup_swap_extents.
Sorry for confusing, Mel.

It was just my guessing about Kosaki's mention but he might say another story.
Ignore me.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-06 22:28                   ` Minchan Kim
  0 siblings, 0 replies; 105+ messages in thread
From: Minchan Kim @ 2010-07-06 22:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, KOSAKI Motohiro, Andrew Morton, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

On Wed, Jul 7, 2010 at 5:27 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Tue, Jul 06, 2010 at 04:25:39PM +0100, Mel Gorman wrote:
>> On Tue, Jul 06, 2010 at 08:24:57PM +0900, Minchan Kim wrote:
>> > but it is still problem in case of swap file.
>> > That's because swapout on swapfile cause file system writepage which
>> > makes kernel stack overflow.
>>
>> I don't *think* this is a problem unless I missed where writing out to
>> swap enters teh filesystem code. I'll double check.
>
> It bypasses the fs.  On swapon, the blocks are resolved
> (mm/swapfile.c::setup_swap_extents) and then the writeout path uses
> bios directly (mm/page_io.c::swap_writepage).
>
> (GFP_NOFS still includes __GFP_IO, so allows swapping)
>
>        Hannes

Thanks, Hannes. You're right.
Extents would be resolved by setup_swap_extents.
Sorry for confusing, Mel.

It was just my guessing about Kosaki's mention but he might say another story.
Ignore me.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
  2010-07-06 22:28                   ` Minchan Kim
  (?)
@ 2010-07-07  0:24                     ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-07  0:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, KOSAKI Motohiro, Andrew Morton, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

On Wed, Jul 07, 2010 at 07:28:14AM +0900, Minchan Kim wrote:
> On Wed, Jul 7, 2010 at 5:27 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Tue, Jul 06, 2010 at 04:25:39PM +0100, Mel Gorman wrote:
> >> On Tue, Jul 06, 2010 at 08:24:57PM +0900, Minchan Kim wrote:
> >> > but it is still problem in case of swap file.
> >> > That's because swapout on swapfile cause file system writepage which
> >> > makes kernel stack overflow.
> >>
> >> I don't *think* this is a problem unless I missed where writing out to
> >> swap enters teh filesystem code. I'll double check.
> >
> > It bypasses the fs.  On swapon, the blocks are resolved
> > (mm/swapfile.c::setup_swap_extents) and then the writeout path uses
> > bios directly (mm/page_io.c::swap_writepage).
> >
> > (GFP_NOFS still includes __GFP_IO, so allows swapping)
> >
> >        Hannes
> 
> Thanks, Hannes. You're right.
> Extents would be resolved by setup_swap_extents.
> Sorry for confusing, Mel.
> 

No confusion. I was 99.99999% certain this was the case and had tested with
a few bug_on's just in case but confirmation is helpful. Thanks both.

What I have now is direct writeback for anon files. For files be it from
kswapd or direct reclaim, I kick writeback pre-emptively by an amount based
on the dirty pages encountered because monitoring from systemtap indicated
that we were getting a large percentage of the dirty file pages at the end
of the LRU lists (bad). Initial tests show that page reclaim writeback is
reduced from kswapd by 97% with this sort of pre-emptive kicking of flusher
threads based on these figures from sysbench.

                traceonly-v4r1  stackreduce-v4r1    flushforward-v4r4
Direct reclaims                                621        710         30928 
Direct reclaim pages scanned                141316     141184       1912093 
Direct reclaim write file async I/O          23904      28714             0 
Direct reclaim write anon async I/O            716        918            88 
Direct reclaim write file sync I/O               0          0             0 
Direct reclaim write anon sync I/O               0          0             0 
Wake kswapd requests                        713250     735588       5626413 
Kswapd wakeups                                1805       1498           641 
Kswapd pages scanned                      17065538   15605327       9524623 
Kswapd reclaim write file async I/O         715768     617225         23938  <-- Wooo
Kswapd reclaim write anon async I/O         218003     214051        198746 
Kswapd reclaim write file sync I/O               0          0             0 
Kswapd reclaim write anon sync I/O               0          0             0 
Time stalled direct reclaim (ms)              9.87      11.63        315.30 
Time kswapd awake (ms)                     1884.91    2088.23       3542.92 

This is "good" IMO because file IO from page reclaim is frowned upon because
of poor IO patterns. There isn't a launder process I can kick for anon pages
to get overall reclaim IO down but it's not clear it's worth it at this
juncture because AFAIK, IO to swap blows anyway. The biggest plus is that
direct reclaim still not call into the filesystem with my current series so
stack overflows are less of a heartache. As the number of pages encountered
for filesystem writeback are reduced, it's also less of a problem for memcg.

The direct reclaim stall latency increases because of congestion_wait
throttling but the overall tests completes 602 seconds faster or by 8% (figures
not included). Scanning rates go up but with reduced-time-to-completion,
on balance I think it works out.

Andrew has picked up some of the series but I have another modification
to the tracepoints to differenciate between anon and file IO which I now
think is a very important distinction as flushers work on one but not the
other. I also must rebase upon a mmotm based on 2.6.35-rc4 before re-posting
the series but broadly speaking, I think we are going the right direction
without needing stack-switching tricks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-07  0:24                     ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-07  0:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, KOSAKI Motohiro, Andrew Morton, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

On Wed, Jul 07, 2010 at 07:28:14AM +0900, Minchan Kim wrote:
> On Wed, Jul 7, 2010 at 5:27 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Tue, Jul 06, 2010 at 04:25:39PM +0100, Mel Gorman wrote:
> >> On Tue, Jul 06, 2010 at 08:24:57PM +0900, Minchan Kim wrote:
> >> > but it is still problem in case of swap file.
> >> > That's because swapout on swapfile cause file system writepage which
> >> > makes kernel stack overflow.
> >>
> >> I don't *think* this is a problem unless I missed where writing out to
> >> swap enters teh filesystem code. I'll double check.
> >
> > It bypasses the fs.  On swapon, the blocks are resolved
> > (mm/swapfile.c::setup_swap_extents) and then the writeout path uses
> > bios directly (mm/page_io.c::swap_writepage).
> >
> > (GFP_NOFS still includes __GFP_IO, so allows swapping)
> >
> >        Hannes
> 
> Thanks, Hannes. You're right.
> Extents would be resolved by setup_swap_extents.
> Sorry for confusing, Mel.
> 

No confusion. I was 99.99999% certain this was the case and had tested with
a few bug_on's just in case but confirmation is helpful. Thanks both.

What I have now is direct writeback for anon files. For files be it from
kswapd or direct reclaim, I kick writeback pre-emptively by an amount based
on the dirty pages encountered because monitoring from systemtap indicated
that we were getting a large percentage of the dirty file pages at the end
of the LRU lists (bad). Initial tests show that page reclaim writeback is
reduced from kswapd by 97% with this sort of pre-emptive kicking of flusher
threads based on these figures from sysbench.

                traceonly-v4r1  stackreduce-v4r1    flushforward-v4r4
Direct reclaims                                621        710         30928 
Direct reclaim pages scanned                141316     141184       1912093 
Direct reclaim write file async I/O          23904      28714             0 
Direct reclaim write anon async I/O            716        918            88 
Direct reclaim write file sync I/O               0          0             0 
Direct reclaim write anon sync I/O               0          0             0 
Wake kswapd requests                        713250     735588       5626413 
Kswapd wakeups                                1805       1498           641 
Kswapd pages scanned                      17065538   15605327       9524623 
Kswapd reclaim write file async I/O         715768     617225         23938  <-- Wooo
Kswapd reclaim write anon async I/O         218003     214051        198746 
Kswapd reclaim write file sync I/O               0          0             0 
Kswapd reclaim write anon sync I/O               0          0             0 
Time stalled direct reclaim (ms)              9.87      11.63        315.30 
Time kswapd awake (ms)                     1884.91    2088.23       3542.92 

This is "good" IMO because file IO from page reclaim is frowned upon because
of poor IO patterns. There isn't a launder process I can kick for anon pages
to get overall reclaim IO down but it's not clear it's worth it at this
juncture because AFAIK, IO to swap blows anyway. The biggest plus is that
direct reclaim still not call into the filesystem with my current series so
stack overflows are less of a heartache. As the number of pages encountered
for filesystem writeback are reduced, it's also less of a problem for memcg.

The direct reclaim stall latency increases because of congestion_wait
throttling but the overall tests completes 602 seconds faster or by 8% (figures
not included). Scanning rates go up but with reduced-time-to-completion,
on balance I think it works out.

Andrew has picked up some of the series but I have another modification
to the tracepoints to differenciate between anon and file IO which I now
think is a very important distinction as flushers work on one but not the
other. I also must rebase upon a mmotm based on 2.6.35-rc4 before re-posting
the series but broadly speaking, I think we are going the right direction
without needing stack-switching tricks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-07  0:24                     ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-07  0:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, KOSAKI Motohiro, Andrew Morton, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

On Wed, Jul 07, 2010 at 07:28:14AM +0900, Minchan Kim wrote:
> On Wed, Jul 7, 2010 at 5:27 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Tue, Jul 06, 2010 at 04:25:39PM +0100, Mel Gorman wrote:
> >> On Tue, Jul 06, 2010 at 08:24:57PM +0900, Minchan Kim wrote:
> >> > but it is still problem in case of swap file.
> >> > That's because swapout on swapfile cause file system writepage which
> >> > makes kernel stack overflow.
> >>
> >> I don't *think* this is a problem unless I missed where writing out to
> >> swap enters teh filesystem code. I'll double check.
> >
> > It bypasses the fs.  On swapon, the blocks are resolved
> > (mm/swapfile.c::setup_swap_extents) and then the writeout path uses
> > bios directly (mm/page_io.c::swap_writepage).
> >
> > (GFP_NOFS still includes __GFP_IO, so allows swapping)
> >
> >        Hannes
> 
> Thanks, Hannes. You're right.
> Extents would be resolved by setup_swap_extents.
> Sorry for confusing, Mel.
> 

No confusion. I was 99.99999% certain this was the case and had tested with
a few bug_on's just in case but confirmation is helpful. Thanks both.

What I have now is direct writeback for anon files. For files be it from
kswapd or direct reclaim, I kick writeback pre-emptively by an amount based
on the dirty pages encountered because monitoring from systemtap indicated
that we were getting a large percentage of the dirty file pages at the end
of the LRU lists (bad). Initial tests show that page reclaim writeback is
reduced from kswapd by 97% with this sort of pre-emptive kicking of flusher
threads based on these figures from sysbench.

                traceonly-v4r1  stackreduce-v4r1    flushforward-v4r4
Direct reclaims                                621        710         30928 
Direct reclaim pages scanned                141316     141184       1912093 
Direct reclaim write file async I/O          23904      28714             0 
Direct reclaim write anon async I/O            716        918            88 
Direct reclaim write file sync I/O               0          0             0 
Direct reclaim write anon sync I/O               0          0             0 
Wake kswapd requests                        713250     735588       5626413 
Kswapd wakeups                                1805       1498           641 
Kswapd pages scanned                      17065538   15605327       9524623 
Kswapd reclaim write file async I/O         715768     617225         23938  <-- Wooo
Kswapd reclaim write anon async I/O         218003     214051        198746 
Kswapd reclaim write file sync I/O               0          0             0 
Kswapd reclaim write anon sync I/O               0          0             0 
Time stalled direct reclaim (ms)              9.87      11.63        315.30 
Time kswapd awake (ms)                     1884.91    2088.23       3542.92 

This is "good" IMO because file IO from page reclaim is frowned upon because
of poor IO patterns. There isn't a launder process I can kick for anon pages
to get overall reclaim IO down but it's not clear it's worth it at this
juncture because AFAIK, IO to swap blows anyway. The biggest plus is that
direct reclaim still not call into the filesystem with my current series so
stack overflows are less of a heartache. As the number of pages encountered
for filesystem writeback are reduced, it's also less of a problem for memcg.

The direct reclaim stall latency increases because of congestion_wait
throttling but the overall tests completes 602 seconds faster or by 8% (figures
not included). Scanning rates go up but with reduced-time-to-completion,
on balance I think it works out.

Andrew has picked up some of the series but I have another modification
to the tracepoints to differenciate between anon and file IO which I now
think is a very important distinction as flushers work on one but not the
other. I also must rebase upon a mmotm based on 2.6.35-rc4 before re-posting
the series but broadly speaking, I think we are going the right direction
without needing stack-switching tricks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
  2010-07-06 20:27                 ` Johannes Weiner
@ 2010-07-07  1:14                   ` Christoph Hellwig
  -1 siblings, 0 replies; 105+ messages in thread
From: Christoph Hellwig @ 2010-07-07  1:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Minchan Kim, KOSAKI Motohiro, Andrew Morton,
	linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

On Tue, Jul 06, 2010 at 10:27:58PM +0200, Johannes Weiner wrote:
> It bypasses the fs.  On swapon, the blocks are resolved
> (mm/swapfile.c::setup_swap_extents) and then the writeout path uses
> bios directly (mm/page_io.c::swap_writepage).
> 
> (GFP_NOFS still includes __GFP_IO, so allows swapping)

Exactly.  Note that while the stack problems for swap writeout aren't
as bad as for filesystems as the whole allocator / extent map footprint
is missing it might still be an issue.  We still splice the whole block
I/O stack footprint over a random stack that might be filled up a lot.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-07  1:14                   ` Christoph Hellwig
  0 siblings, 0 replies; 105+ messages in thread
From: Christoph Hellwig @ 2010-07-07  1:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Minchan Kim, KOSAKI Motohiro, Andrew Morton,
	linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

On Tue, Jul 06, 2010 at 10:27:58PM +0200, Johannes Weiner wrote:
> It bypasses the fs.  On swapon, the blocks are resolved
> (mm/swapfile.c::setup_swap_extents) and then the writeout path uses
> bios directly (mm/page_io.c::swap_writepage).
> 
> (GFP_NOFS still includes __GFP_IO, so allows swapping)

Exactly.  Note that while the stack problems for swap writeout aren't
as bad as for filesystems as the whole allocator / extent map footprint
is missing it might still be an issue.  We still splice the whole block
I/O stack footprint over a random stack that might be filled up a lot.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
  2010-07-07  0:24                     ` Mel Gorman
@ 2010-07-07  1:15                       ` Christoph Hellwig
  -1 siblings, 0 replies; 105+ messages in thread
From: Christoph Hellwig @ 2010-07-07  1:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Minchan Kim, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

On Wed, Jul 07, 2010 at 01:24:58AM +0100, Mel Gorman wrote:
> What I have now is direct writeback for anon files. For files be it from
> kswapd or direct reclaim, I kick writeback pre-emptively by an amount based
> on the dirty pages encountered because monitoring from systemtap indicated
> that we were getting a large percentage of the dirty file pages at the end
> of the LRU lists (bad). Initial tests show that page reclaim writeback is
> reduced from kswapd by 97% with this sort of pre-emptive kicking of flusher
> threads based on these figures from sysbench.

That sounds like yet another bad aid to me.  Instead it would be much
better to not have so many file pages at the end of LRU by tuning the
flusher threads and VM better.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-07  1:15                       ` Christoph Hellwig
  0 siblings, 0 replies; 105+ messages in thread
From: Christoph Hellwig @ 2010-07-07  1:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Minchan Kim, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

On Wed, Jul 07, 2010 at 01:24:58AM +0100, Mel Gorman wrote:
> What I have now is direct writeback for anon files. For files be it from
> kswapd or direct reclaim, I kick writeback pre-emptively by an amount based
> on the dirty pages encountered because monitoring from systemtap indicated
> that we were getting a large percentage of the dirty file pages at the end
> of the LRU lists (bad). Initial tests show that page reclaim writeback is
> reduced from kswapd by 97% with this sort of pre-emptive kicking of flusher
> threads based on these figures from sysbench.

That sounds like yet another bad aid to me.  Instead it would be much
better to not have so many file pages at the end of LRU by tuning the
flusher threads and VM better.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
  2010-07-05 13:49       ` Mel Gorman
@ 2010-07-07  5:03         ` Wu Fengguang
  -1 siblings, 0 replies; 105+ messages in thread
From: Wu Fengguang @ 2010-07-07  5:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrea Arcangeli, Jan Kara

Hi Mel,

> Second, using systemtap, I was able to see that file-backed dirty
> pages have a tendency to be near the end of the LRU even though they
> are a small percentage of the overall pages in the LRU. I'm hoping
> to figure out why this is as it would make avoiding writeback a lot
> less controversial.

Your intuitions are correct -- the current background writeback logic
fails to write elder inodes first. Under heavy loads the background
writeback job may run for ever, totally ignoring the time order of
inode->dirtied_when. This is probably why you see lots of dirty pages
near the end of LRU.

Here is an old patch for fixing this. Sorry for being late. I'll
pick up and refresh the patch series ASAP.  (I made a mistake last
year to post too many patches at one time. I'll break them up into
more manageable pieces.)

[PATCH 31/45] writeback: sync old inodes first in background writeback
<https://kerneltrap.org/mailarchive/linux-fsdevel/2009/10/7/6476313>

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-07  5:03         ` Wu Fengguang
  0 siblings, 0 replies; 105+ messages in thread
From: Wu Fengguang @ 2010-07-07  5:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrea Arcangeli, Jan Kara

Hi Mel,

> Second, using systemtap, I was able to see that file-backed dirty
> pages have a tendency to be near the end of the LRU even though they
> are a small percentage of the overall pages in the LRU. I'm hoping
> to figure out why this is as it would make avoiding writeback a lot
> less controversial.

Your intuitions are correct -- the current background writeback logic
fails to write elder inodes first. Under heavy loads the background
writeback job may run for ever, totally ignoring the time order of
inode->dirtied_when. This is probably why you see lots of dirty pages
near the end of LRU.

Here is an old patch for fixing this. Sorry for being late. I'll
pick up and refresh the patch series ASAP.  (I made a mistake last
year to post too many patches at one time. I'll break them up into
more manageable pieces.)

[PATCH 31/45] writeback: sync old inodes first in background writeback
<https://kerneltrap.org/mailarchive/linux-fsdevel/2009/10/7/6476313>

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
  2010-07-07  1:15                       ` Christoph Hellwig
@ 2010-07-07  9:43                         ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-07  9:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Minchan Kim, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, KAMEZAWA Hiroyuki, Andrea Arcangeli

On Tue, Jul 06, 2010 at 09:15:33PM -0400, Christoph Hellwig wrote:
> On Wed, Jul 07, 2010 at 01:24:58AM +0100, Mel Gorman wrote:
> > What I have now is direct writeback for anon files. For files be it from
> > kswapd or direct reclaim, I kick writeback pre-emptively by an amount based
> > on the dirty pages encountered because monitoring from systemtap indicated
> > that we were getting a large percentage of the dirty file pages at the end
> > of the LRU lists (bad). Initial tests show that page reclaim writeback is
> > reduced from kswapd by 97% with this sort of pre-emptive kicking of flusher
> > threads based on these figures from sysbench.
> 
> That sounds like yet another bad aid to me.  Instead it would be much
> better to not have so many file pages at the end of LRU by tuning the
> flusher threads and VM better.
> 

Do you mean "so many dirty file pages"? I'm going to assume you do.

How do you suggest tuning this? The modification I tried was "if N dirty
pages are found during a SWAP_CLUSTER_MAX scan of pages, assume an average
dirtying density of at least that during the time those pages were inserted on
the LRU. In response, ask the flushers to flush 1.5X". This roughly responds
to the conditions it finds as they are encountered and is based on scanning
rates instead of time. It seemed like a reasonable option.

Based on what I've seen, we are generally below the dirty_ratio and the
flushers are behaving as expected so there is little tuning available there. As
new dirty pages are added to the inactive list, they are allowed to reach the
bottom of the LRU before the periodic sync kicks in. From what I can tell,
it's already the case that flusher threads are cleaning the oldest inodes
first and I'd expect there to be a rough correlation between oldest inode
and oldest pages.

We could reduce the dirty_ratio but people already complain about workloads
that do not allow enough pages to be dirtied. We could decrease the sync
time for flusher threads but then it might be starting IO sooner than it
should and it might be unnecessary if the system is under no memory pressure.

Alternatives?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-07  9:43                         ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-07  9:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Minchan Kim, Johannes Weiner, KOSAKI Motohiro, Andrew Morton,
	linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, KAMEZAWA Hiroyuki, Andrea Arcangeli

On Tue, Jul 06, 2010 at 09:15:33PM -0400, Christoph Hellwig wrote:
> On Wed, Jul 07, 2010 at 01:24:58AM +0100, Mel Gorman wrote:
> > What I have now is direct writeback for anon files. For files be it from
> > kswapd or direct reclaim, I kick writeback pre-emptively by an amount based
> > on the dirty pages encountered because monitoring from systemtap indicated
> > that we were getting a large percentage of the dirty file pages at the end
> > of the LRU lists (bad). Initial tests show that page reclaim writeback is
> > reduced from kswapd by 97% with this sort of pre-emptive kicking of flusher
> > threads based on these figures from sysbench.
> 
> That sounds like yet another bad aid to me.  Instead it would be much
> better to not have so many file pages at the end of LRU by tuning the
> flusher threads and VM better.
> 

Do you mean "so many dirty file pages"? I'm going to assume you do.

How do you suggest tuning this? The modification I tried was "if N dirty
pages are found during a SWAP_CLUSTER_MAX scan of pages, assume an average
dirtying density of at least that during the time those pages were inserted on
the LRU. In response, ask the flushers to flush 1.5X". This roughly responds
to the conditions it finds as they are encountered and is based on scanning
rates instead of time. It seemed like a reasonable option.

Based on what I've seen, we are generally below the dirty_ratio and the
flushers are behaving as expected so there is little tuning available there. As
new dirty pages are added to the inactive list, they are allowed to reach the
bottom of the LRU before the periodic sync kicks in. From what I can tell,
it's already the case that flusher threads are cleaning the oldest inodes
first and I'd expect there to be a rough correlation between oldest inode
and oldest pages.

We could reduce the dirty_ratio but people already complain about workloads
that do not allow enough pages to be dirtied. We could decrease the sync
time for flusher threads but then it might be starting IO sooner than it
should and it might be unnecessary if the system is under no memory pressure.

Alternatives?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
  2010-07-07  5:03         ` Wu Fengguang
@ 2010-07-07  9:50           ` Mel Gorman
  -1 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-07  9:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrea Arcangeli, Jan Kara

On Wed, Jul 07, 2010 at 01:03:38PM +0800, Wu Fengguang wrote:
> Hi Mel,
> 
> > Second, using systemtap, I was able to see that file-backed dirty
> > pages have a tendency to be near the end of the LRU even though they
> > are a small percentage of the overall pages in the LRU. I'm hoping
> > to figure out why this is as it would make avoiding writeback a lot
> > less controversial.
> 
> Your intuitions are correct -- the current background writeback logic
> fails to write elder inodes first. Under heavy loads the background
> writeback job may run for ever, totally ignoring the time order of
> inode->dirtied_when. This is probably why you see lots of dirty pages
> near the end of LRU.
> 

Possible. In a mail to Christoph, I asserted that writeback of elder inodes
was happening first but I obviously could be mistaken.

> Here is an old patch for fixing this. Sorry for being late. I'll
> pick up and refresh the patch series ASAP.  (I made a mistake last
> year to post too many patches at one time. I'll break them up into
> more manageable pieces.)
> 
> [PATCH 31/45] writeback: sync old inodes first in background writeback
> <https://kerneltrap.org/mailarchive/linux-fsdevel/2009/10/7/6476313>
> 

I'll check it out as an alternative to forward-flushing based on the
amount of dirty pages encountered during scanning. Thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-07  9:50           ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-07-07  9:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrea Arcangeli, Jan Kara

On Wed, Jul 07, 2010 at 01:03:38PM +0800, Wu Fengguang wrote:
> Hi Mel,
> 
> > Second, using systemtap, I was able to see that file-backed dirty
> > pages have a tendency to be near the end of the LRU even though they
> > are a small percentage of the overall pages in the LRU. I'm hoping
> > to figure out why this is as it would make avoiding writeback a lot
> > less controversial.
> 
> Your intuitions are correct -- the current background writeback logic
> fails to write elder inodes first. Under heavy loads the background
> writeback job may run for ever, totally ignoring the time order of
> inode->dirtied_when. This is probably why you see lots of dirty pages
> near the end of LRU.
> 

Possible. In a mail to Christoph, I asserted that writeback of elder inodes
was happening first but I obviously could be mistaken.

> Here is an old patch for fixing this. Sorry for being late. I'll
> pick up and refresh the patch series ASAP.  (I made a mistake last
> year to post too many patches at one time. I'll break them up into
> more manageable pieces.)
> 
> [PATCH 31/45] writeback: sync old inodes first in background writeback
> <https://kerneltrap.org/mailarchive/linux-fsdevel/2009/10/7/6476313>
> 

I'll check it out as an alternative to forward-flushing based on the
amount of dirty pages encountered during scanning. Thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
  2010-07-07  9:43                         ` Mel Gorman
@ 2010-07-07 12:51                           ` Rik van Riel
  -1 siblings, 0 replies; 105+ messages in thread
From: Rik van Riel @ 2010-07-07 12:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, Minchan Kim, Johannes Weiner, KOSAKI Motohiro,
	Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

On 07/07/2010 05:43 AM, Mel Gorman wrote:

> How do you suggest tuning this? The modification I tried was "if N dirty
> pages are found during a SWAP_CLUSTER_MAX scan of pages, assume an average
> dirtying density of at least that during the time those pages were inserted on
> the LRU. In response, ask the flushers to flush 1.5X". This roughly responds
> to the conditions it finds as they are encountered and is based on scanning
> rates instead of time. It seemed like a reasonable option.

Your idea sounds like something we need to have, regardless
of whether or not we fix the flusher to flush older inodes
first (we probably should do that, too).

I believe this for the simple reason that we could have too
many dirty pages in one memory zone, while the flusher's
dirty threshold is system wide.

If we both fix the flusher to flush old inodes first and
kick the flusher from the reclaim code, we should be
golden.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-07 12:51                           ` Rik van Riel
  0 siblings, 0 replies; 105+ messages in thread
From: Rik van Riel @ 2010-07-07 12:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, Minchan Kim, Johannes Weiner, KOSAKI Motohiro,
	Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

On 07/07/2010 05:43 AM, Mel Gorman wrote:

> How do you suggest tuning this? The modification I tried was "if N dirty
> pages are found during a SWAP_CLUSTER_MAX scan of pages, assume an average
> dirtying density of at least that during the time those pages were inserted on
> the LRU. In response, ask the flushers to flush 1.5X". This roughly responds
> to the conditions it finds as they are encountered and is based on scanning
> rates instead of time. It seemed like a reasonable option.

Your idea sounds like something we need to have, regardless
of whether or not we fix the flusher to flush older inodes
first (we probably should do that, too).

I believe this for the simple reason that we could have too
many dirty pages in one memory zone, while the flusher's
dirty threshold is system wide.

If we both fix the flusher to flush old inodes first and
kick the flusher from the reclaim code, we should be
golden.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
  2010-07-07  5:03         ` Wu Fengguang
@ 2010-07-07 18:09           ` Christoph Hellwig
  -1 siblings, 0 replies; 105+ messages in thread
From: Christoph Hellwig @ 2010-07-07 18:09 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Mel Gorman, Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrea Arcangeli, Jan Kara

On Wed, Jul 07, 2010 at 01:03:38PM +0800, Wu Fengguang wrote:
> Here is an old patch for fixing this. Sorry for being late. I'll
> pick up and refresh the patch series ASAP.  (I made a mistake last
> year to post too many patches at one time. I'll break them up into
> more manageable pieces.)

Yes, that would be very welcome.  There's a lot of important work
in that series.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-07 18:09           ` Christoph Hellwig
  0 siblings, 0 replies; 105+ messages in thread
From: Christoph Hellwig @ 2010-07-07 18:09 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Mel Gorman, Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrea Arcangeli, Jan Kara

On Wed, Jul 07, 2010 at 01:03:38PM +0800, Wu Fengguang wrote:
> Here is an old patch for fixing this. Sorry for being late. I'll
> pick up and refresh the patch series ASAP.  (I made a mistake last
> year to post too many patches at one time. I'll break them up into
> more manageable pieces.)

Yes, that would be very welcome.  There's a lot of important work
in that series.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
  2010-07-06 20:27                 ` Johannes Weiner
@ 2010-07-08  6:39                   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2010-07-08  6:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: kosaki.motohiro, Mel Gorman, Minchan Kim, Andrew Morton,
	linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

> On Tue, Jul 06, 2010 at 04:25:39PM +0100, Mel Gorman wrote:
> > On Tue, Jul 06, 2010 at 08:24:57PM +0900, Minchan Kim wrote:
> > > but it is still problem in case of swap file.
> > > That's because swapout on swapfile cause file system writepage which
> > > makes kernel stack overflow.
> > 
> > I don't *think* this is a problem unless I missed where writing out to
> > swap enters teh filesystem code. I'll double check.
> 
> It bypasses the fs.  On swapon, the blocks are resolved
> (mm/swapfile.c::setup_swap_extents) and then the writeout path uses
> bios directly (mm/page_io.c::swap_writepage).

Yeah, my fault. I did misunderstand this.

Thank you.



> 
> (GFP_NOFS still includes __GFP_IO, so allows swapping)
> 
> 	Hannes




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim
@ 2010-07-08  6:39                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2010-07-08  6:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: kosaki.motohiro, Mel Gorman, Minchan Kim, Andrew Morton,
	linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli

> On Tue, Jul 06, 2010 at 04:25:39PM +0100, Mel Gorman wrote:
> > On Tue, Jul 06, 2010 at 08:24:57PM +0900, Minchan Kim wrote:
> > > but it is still problem in case of swap file.
> > > That's because swapout on swapfile cause file system writepage which
> > > makes kernel stack overflow.
> > 
> > I don't *think* this is a problem unless I missed where writing out to
> > swap enters teh filesystem code. I'll double check.
> 
> It bypasses the fs.  On swapon, the blocks are resolved
> (mm/swapfile.c::setup_swap_extents) and then the writeout path uses
> bios directly (mm/page_io.c::swap_writepage).

Yeah, my fault. I did misunderstand this.

Thank you.



> 
> (GFP_NOFS still includes __GFP_IO, so allows swapping)
> 
> 	Hannes



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

end of thread, other threads:[~2010-07-08  6:39 UTC | newest]

Thread overview: 105+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-29 11:34 [PATCH 0/14] Avoid overflowing of stack during page reclaim V3 Mel Gorman
2010-06-29 11:34 ` Mel Gorman
2010-06-29 11:34 ` [PATCH 01/14] vmscan: Fix mapping use after free Mel Gorman
2010-06-29 11:34   ` Mel Gorman
2010-06-29 14:27   ` Minchan Kim
2010-06-29 14:27     ` Minchan Kim
2010-07-01  9:53     ` Mel Gorman
2010-07-01  9:53       ` Mel Gorman
2010-06-29 14:44   ` Johannes Weiner
2010-06-29 14:44     ` Johannes Weiner
2010-06-29 11:34 ` [PATCH 02/14] tracing, vmscan: Add trace events for kswapd wakeup, sleeping and direct reclaim Mel Gorman
2010-06-29 11:34   ` Mel Gorman
2010-06-29 11:34 ` [PATCH 03/14] tracing, vmscan: Add trace events for LRU page isolation Mel Gorman
2010-06-29 11:34   ` Mel Gorman
2010-06-29 11:34 ` [PATCH 04/14] tracing, vmscan: Add trace event when a page is written Mel Gorman
2010-06-29 11:34   ` Mel Gorman
2010-06-29 11:34 ` [PATCH 05/14] tracing, vmscan: Add a postprocessing script for reclaim-related ftrace events Mel Gorman
2010-06-29 11:34   ` Mel Gorman
2010-06-29 11:34 ` [PATCH 06/14] vmscan: kill prev_priority completely Mel Gorman
2010-06-29 11:34   ` Mel Gorman
2010-06-29 11:34 ` [PATCH 07/14] vmscan: simplify shrink_inactive_list() Mel Gorman
2010-06-29 11:34   ` Mel Gorman
2010-06-29 11:34 ` [PATCH 08/14] vmscan: Remove unnecessary temporary vars in do_try_to_free_pages Mel Gorman
2010-06-29 11:34   ` Mel Gorman
2010-06-29 11:34 ` [PATCH 09/14] vmscan: Setup pagevec as late as possible in shrink_inactive_list() Mel Gorman
2010-06-29 11:34   ` Mel Gorman
2010-06-29 11:34 ` [PATCH 10/14] vmscan: Setup pagevec as late as possible in shrink_page_list() Mel Gorman
2010-06-29 11:34   ` Mel Gorman
2010-06-29 11:34 ` [PATCH 11/14] vmscan: Update isolated page counters outside of main path in shrink_inactive_list() Mel Gorman
2010-06-29 11:34   ` Mel Gorman
2010-06-29 11:34 ` [PATCH 12/14] vmscan: Do not writeback pages in direct reclaim Mel Gorman
2010-06-29 11:34   ` Mel Gorman
2010-07-02 19:51   ` Andrew Morton
2010-07-02 19:51     ` Andrew Morton
2010-07-05 13:49     ` Mel Gorman
2010-07-05 13:49       ` Mel Gorman
2010-07-06  0:36       ` KOSAKI Motohiro
2010-07-06  0:36         ` KOSAKI Motohiro
2010-07-06  5:46         ` Minchan Kim
2010-07-06  5:46           ` Minchan Kim
2010-07-06  6:02           ` KOSAKI Motohiro
2010-07-06  6:02             ` KOSAKI Motohiro
2010-07-06  6:38             ` Minchan Kim
2010-07-06  6:38               ` Minchan Kim
2010-07-06 10:12         ` Mel Gorman
2010-07-06 10:12           ` Mel Gorman
2010-07-06 11:13           ` KOSAKI Motohiro
2010-07-06 11:13             ` KOSAKI Motohiro
2010-07-06 11:24           ` Minchan Kim
2010-07-06 11:24             ` Minchan Kim
2010-07-06 15:25             ` Mel Gorman
2010-07-06 15:25               ` Mel Gorman
2010-07-06 15:25               ` Mel Gorman
2010-07-06 20:27               ` Johannes Weiner
2010-07-06 20:27                 ` Johannes Weiner
2010-07-06 22:28                 ` Minchan Kim
2010-07-06 22:28                   ` Minchan Kim
2010-07-07  0:24                   ` Mel Gorman
2010-07-07  0:24                     ` Mel Gorman
2010-07-07  0:24                     ` Mel Gorman
2010-07-07  1:15                     ` Christoph Hellwig
2010-07-07  1:15                       ` Christoph Hellwig
2010-07-07  9:43                       ` Mel Gorman
2010-07-07  9:43                         ` Mel Gorman
2010-07-07 12:51                         ` Rik van Riel
2010-07-07 12:51                           ` Rik van Riel
2010-07-07  1:14                 ` Christoph Hellwig
2010-07-07  1:14                   ` Christoph Hellwig
2010-07-08  6:39                 ` KOSAKI Motohiro
2010-07-08  6:39                   ` KOSAKI Motohiro
2010-07-07  5:03       ` Wu Fengguang
2010-07-07  5:03         ` Wu Fengguang
2010-07-07  9:50         ` Mel Gorman
2010-07-07  9:50           ` Mel Gorman
2010-07-07 18:09         ` Christoph Hellwig
2010-07-07 18:09           ` Christoph Hellwig
2010-06-29 11:34 ` [PATCH 13/14] fs,btrfs: Allow kswapd to writeback pages Mel Gorman
2010-06-29 11:34   ` Mel Gorman
2010-06-30 13:05   ` Chris Mason
2010-06-30 13:05     ` Chris Mason
2010-07-01  9:55     ` Mel Gorman
2010-07-01  9:55       ` Mel Gorman
2010-07-01  9:55       ` Mel Gorman
2010-06-29 11:34 ` [PATCH 14/14] fs,xfs: " Mel Gorman
2010-06-29 11:34   ` Mel Gorman
2010-06-29 12:37   ` Christoph Hellwig
2010-06-29 12:37     ` Christoph Hellwig
2010-06-29 12:51     ` Mel Gorman
2010-06-29 12:51       ` Mel Gorman
2010-06-30  0:14       ` KAMEZAWA Hiroyuki
2010-06-30  0:14         ` KAMEZAWA Hiroyuki
2010-07-01 10:30         ` Mel Gorman
2010-07-01 10:30           ` Mel Gorman
2010-07-02  6:26           ` KAMEZAWA Hiroyuki
2010-07-02  6:26             ` KAMEZAWA Hiroyuki
2010-07-02  6:31             ` KAMEZAWA Hiroyuki
2010-07-02  6:31               ` KAMEZAWA Hiroyuki
2010-07-05 14:16             ` Mel Gorman
2010-07-05 14:16               ` Mel Gorman
2010-07-06  0:45               ` KAMEZAWA Hiroyuki
2010-07-06  0:45                 ` KAMEZAWA Hiroyuki
2010-07-02 19:33 ` [PATCH 0/14] Avoid overflowing of stack during page reclaim V3 Andrew Morton
2010-07-02 19:33   ` Andrew Morton
2010-07-05  1:35   ` KAMEZAWA Hiroyuki
2010-07-05  1:35     ` KAMEZAWA Hiroyuki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.