[PATCH 0/14] Avoid overflowing of stack during page reclaim V3

* [PATCH 0/14] Avoid overflowing of stack during page reclaim V3
@ 2010-06-29 11:34 ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2010-06-29 11:34 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman

Here is V3 that depends again on flusher threads to do writeback in
direct reclaim rather than stack switching which is not something I'm
likely to get done before xfs/btrfs are ignoring writeback in mainline
(phd sucking up time). Instead, direct reclaimers that encounter dirty
pages call congestion_wait and in the case of lumpy reclaim will wait on
the specific pages. A memory pressure test did not show up premature OOM
problems that some had concerns about.

The details below are long but the short summary is that on balance this
patchset appears to behave better than the vanilla kernel with fewer pages
being written back by the VM, high order allocation under stress performing
quite well and xfs and btrfs both obeying writepage again. ext4 is still
largely ignoring writepage from reclaim context but it'd more significant
legwork to fix that.

Changelog since V2
  o Add acks and reviewed-bys
  o Do not lock multiple pages at the same time for writeback as it's unsafe
  o Drop the clean_page_list function. It alters timing with very little
    benefit. Without the contiguous writing, it doesn't do much to simplify
    the subsequent patches either
  o Throttle processes that encounter dirty pages in direct reclaim. Instead
    wakeup flusher threads to clean the number of pages encountered that were
    dirty

Changelog since V1
  o Merge with series that reduces stack usage in page reclaim in general
  o Allow memcg to writeback pages as they are not expected to overflow stack
  o Drop the contiguous-write patch for the moment

There is a problem in the stack depth usage of page reclaim. Particularly
during direct reclaim, it is possible to overflow the stack if it calls
into the filesystems writepage function. This patch series aims to trace
writebacks so it can be evaulated how many dirty pages are being written,
reduce stack usage of page reclaim in general and avoid direct reclaim
writing back pages and overflowing the stack.

The first patch is a fix by Nick Piggin to use lock_page_nosync instead of
lock_page when handling a write error as the mapping may have vanished after the page
was unlocked.

The second 4 patches are a forward-port of trace points that are partly based
on trace points defined by Larry Woodman but never merged. They trace parts of
kswapd, direct reclaim, LRU page isolation and page writeback. The tracepoints
can be used to evaluate what is happening within reclaim and whether things
are getting better or worse. They do not have to be part of the final series
but might be useful during discussion and for later regression testing -
particularly around the percentage of time spent in reclaim.

The 6 patches after that reduce the stack footprint of page reclaim by moving
large allocations out of the main call path. Functionally they should be
similar although there is a timing change on when pages get freed exactly.
This is aimed at giving filesystems as much stack as possible if kswapd is
to writeback pages directly.

Patch 12 prevents direct reclaim writing out pages at all and instead the
flusher threads are asked to clean the number of pages encountered, the caller
waits on congestion and puts the pages back on the LRU.  For lumpy reclaim,
the caller will wait for a time calling the flusher multiple times waiting
on dirty pages to be written out before trying to reclaim the dirty pages a
second time. This increases the responsibility of kswapd somewhat because
it's now cleaning pages on behalf of direct reclaimers but kswapd seemed
a better fit than background flushers to clean pages as it knows where the
pages needing cleaning are. As it's async IO, it should not cause kswapd to
stall (at least until the queue is congested) but the order that pages are
reclaimed on the LRU is altered. Dirty pages that would have been reclaimed
by direct reclaimers are getting another lap on the LRU. The dirty pages
could have been put on a dedicated list but this increased counter overhead
and the number of lists and it is unclear if it is necessary.

The final two patches revert chances on XFS and btrfs that ignore writeback
from reclaim context.

I ran a number of tests with monitoring on X86, X86-64 and PPC64 and I'll
cover what the X86-64 results were here. It's an AMD Phenom 4-core machine
with 2G of RAM with a single disk and the onboard IO controller. Dirty ratio
was left at 20 but tests with 40 did not show up surprises. The filesystem
all the tests were run on was XFS.

Three kernels are compared.

traceonly-v3r1		is the first 4 patches of this series
stackreduce-v3r1	is the first 12 patches of this series
nodirect-v3r9		is all patches in the series

The results on each test is broken up into three parts. The first part
compares the results of the test itself. The second part is a report based
on the ftrace postprocessing script in patch 4 and reports on direct reclaim
and kswapd activity. The third part reports what percentage of time was
spent in direct reclaim and kswapd being awake.

To work out the percentage of time spent in direct reclaim, I used
/usr/bin/time to get the User + Sys CPU time. The stalled time was taken
from the post-processing script.  The total time is (User + Sys + Stall)
and obviously the percentage is of stalled over total time.

kernbench
=========

                traceonly-v3r1  stackreduce-v3r1     nodirect-v3r9
Elapsed min       98.00 ( 0.00%)    98.26 (-0.27%)    98.11 (-0.11%)
Elapsed mean      98.19 ( 0.00%)    98.35 (-0.16%)    98.22 (-0.03%)
Elapsed stddev     0.16 ( 0.00%)     0.06 (62.07%)     0.11 (32.71%)
Elapsed max       98.44 ( 0.00%)    98.42 ( 0.02%)    98.39 ( 0.05%)
User    min      310.43 ( 0.00%)   311.14 (-0.23%)   311.76 (-0.43%)
User    mean     311.34 ( 0.00%)   311.48 (-0.04%)   312.03 (-0.22%)
User    stddev     0.76 ( 0.00%)     0.34 (55.52%)     0.20 (73.54%)
User    max      312.51 ( 0.00%)   311.86 ( 0.21%)   312.30 ( 0.07%)
System  min       39.62 ( 0.00%)    40.76 (-2.88%)    40.03 (-1.03%)
System  mean      40.44 ( 0.00%)    40.92 (-1.20%)    40.18 ( 0.64%)
System  stddev     0.53 ( 0.00%)     0.20 (62.94%)     0.09 (82.81%)
System  max       41.11 ( 0.00%)    41.25 (-0.34%)    40.28 ( 2.02%)
CPU     min      357.00 ( 0.00%)   358.00 (-0.28%)   358.00 (-0.28%)
CPU     mean     357.75 ( 0.00%)   358.00 (-0.07%)   358.00 (-0.07%)
CPU     stddev     0.43 ( 0.00%)     0.00 (100.00%)     0.00 (100.00%)
CPU     max      358.00 ( 0.00%)   358.00 ( 0.00%)   358.00 ( 0.00%)
FTrace Reclaim Statistics
                                    traceonly-v3r1  stackreduce-v3r1   nodirect-v3r9
Direct reclaims                                  0          0          0 
Direct reclaim pages scanned                     0          0          0 
Direct reclaim write async I/O                   0          0          0 
Direct reclaim write sync I/O                    0          0          0 
Wake kswapd requests                             0          0          0 
Kswapd wakeups                                   0          0          0 
Kswapd pages scanned                             0          0          0 
Kswapd reclaim write async I/O                   0          0          0 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)              0.00       0.00       0.00 
Time kswapd awake (ms)                        0.00       0.00       0.00 

User/Sys Time Running Test (seconds)       2142.97   2145.36   2145.24
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
Total Elapsed Time (seconds)                787.41    787.54    791.01
Percentage Time kswapd Awake                 0.00%     0.00%     0.00%

kernbench is a straight-forward kernel compile. Kernel is built 5 times and and an
average taken. There was no interesting difference in terms of performance. As the
workload fit easily in memory, there was no page reclaim activity.

SysBench
========
FTrace Reclaim Statistics
                             traceonly-v3r1  stackreduce-v3r1     nodirect-v3r9
Direct reclaims                                595        948       2113 
Direct reclaim pages scanned                223195     236628     145547 
Direct reclaim write async I/O               39437      26136          0 
Direct reclaim write sync I/O                    0         29          0 
Wake kswapd requests                        703495     948725    1747346 
Kswapd wakeups                                1586       1631       1303 
Kswapd pages scanned                      15216731   16883788   15396343 
Kswapd reclaim write async I/O              877359     961730     235228 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)             11.97      25.39      41.78 
Time kswapd awake (ms)                     1702.04    2178.75    2719.17 

User/Sys Time Running Test (seconds)        652.11    684.77    678.44
Percentage Time Spent Direct Reclaim         0.01%     0.00%     0.00%
Total Elapsed Time (seconds)               6033.29   6665.51   6715.48
Percentage Time kswapd Awake                 0.05%     0.00%     0.00%

The results were based on a read/write and as the machine is under-provisioned
for the type of tests, figures are very unstable so not reported.  with
variances up to 15%. Part of the problem is that larger thread counts push
the test into swap as the memory is insufficient and destabilises results
further. I could tune for this, but it was reclaim that was important.

The writing back of dirty pages was a factor. In previous tests, around 2%
of pages encountered scanned were dirtied. In this test, the percentage
was higher at 17% but interestingly, the number of dirty pages encountered
were roughly the same as previous tests and what changed is the number of
pages scanned. I've no useful theory on this yet other than to note that
timing is a very significant factor when analysing the ratio of dirty to
clean pages encountered. Whether swapping occured or not at any given time
is also likely a significant factor.

Between 5-10% of the pages scanned by kswapd need to be written back based
on these three kernels. As the disk was maxed out all of the time, I'm having
trouble deciding whether this is "too much IO" or not but I'm leaning towards
"no". I'd expect that the flusher was also having time getting IO bandwidth.

Overall though, based on a number of tests, the performance with or without
the patches are roughly the same with the main difference being that direct
reclaim is not writing back pages.

Simple Writeback Test
=====================
FTrace Reclaim Statistics
                traceonly-v3r1  stackreduce-v3r1     nodirect-v3r9
Direct reclaims                               2301       2382       2682 
Direct reclaim pages scanned                294528     251080     283305 
Direct reclaim write async I/O                   0          0          0 
Direct reclaim write sync I/O                    0          0          0 
Wake kswapd requests                       1688511    1561619    1718567 
Kswapd wakeups                                1165       1096       1103 
Kswapd pages scanned                      11125671   10993353   11029352 
Kswapd reclaim write async I/O                   0          0          0 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)              0.01       0.01       0.01 
Time kswapd awake (ms)                      305.16     302.31     293.07 

User/Sys Time Running Test (seconds)        103.92    104.48    104.45
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
Total Elapsed Time (seconds)                636.08    638.11    634.76
Percentage Time kswapd Awake                 0.05%     0.00%     0.00%

This test starting with 4 threads, doubling the number of threads on each
iteration up to 64. Each iteration writes 4*RAM amount of files to disk
using dd to dirty memory and conv=fsync to have some sort of stability in
the results.

Direct reclaim writeback was not a problem for this test even though a
number of pages were scanned so there is no reason not to disable it.

Stress HighAlloc
================
                traceonly-v3r1  stackreduce-v3r1     nodirect-v3r9
Pass 1          79.00 ( 0.00%)    77.00 (-2.00%)    74.00 (-5.00%)
Pass 2          80.00 ( 0.00%)    80.00 ( 0.00%)    76.00 (-4.00%)
At Rest         81.00 ( 0.00%)    82.00 ( 1.00%)    83.00 ( 2.00%)

FTrace Reclaim Statistics
                                    traceonly-v3r1  stackreduce-v3r1     nodirect-v3r9
Direct reclaims                               1209       1214       1261 
Direct reclaim pages scanned                177575     141702     163362 
Direct reclaim write async I/O               36877      27783          0 
Direct reclaim write sync I/O                17822      10834          0 
Wake kswapd requests                          4656       1178       4195 
Kswapd wakeups                                 861        854        904 
Kswapd pages scanned                      50583598   33013186   22497685 
Kswapd reclaim write async I/O             3880667    3676997    1363530 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)           7980.65    7836.76    6008.81 
Time kswapd awake (ms)                     5038.21    5329.90    2912.61 

User/Sys Time Running Test (seconds)       2818.55   2827.11   2821.88
Percentage Time Spent Direct Reclaim         0.21%     0.00%     0.00%
Total Elapsed Time (seconds)              10304.58  10151.72   8471.06
Percentage Time kswapd Awake                 0.03%     0.00%     0.00%

This test builds a large number of kernels simultaneously so that the total
workload is 1.5 times the size of RAM. It then attempts to allocate all of
RAM as huge pages. The metric is the percentage of memory allocated using
load (Pass 1), a second attempt under load (Pass 2) and when the kernel
compiles are finishes and the system is quiet (At Rest).

The success figures were comparaible with or without direct reclaim. Unlikely
V2, the success rates on PPC64 are actually improved but not reported here.

Unlike the other tests, synchronous direct writeback is a factor in this test
because of lumpy reclaim. This increases the stall time of a lumpy reclaimer
by quite a margin. Compare the "Time stalled direct reclaim" between the
vanilla and nodirect kernel - the nodirect kernel is stalled less time
but not dramatically less as direct reclaimers stall when dirty pages are
encountered. Interestingly, the time kswapd is stalled is significantly
reduced and the test completes faster.

Memory Pressure leading to OOM
==============================

There were concerns that direct reclaim not writing back pages under heavy memory
pressure could lead to premature OOMs. To test this theory a test was run as follows;

1. Remove existing swap, create a swapfile on XFS and activate it
2. Start 1 kernel compile on XFS per CPU in the system, wait until they start
3. Start 2xNR_CPUs processes writing zero-filled files to XFS. The total size of the
   files was the amount of physical memory on the system
4. Start 2xNR_CPUs processes that map anonymous memory and continually dirty it. The
   total size of the mappings was the amount of physical memory in the system

This stress test was allowed to run for a period of time during which load, swap and IO
activity were all high. No premature OOMs were found.

FTrace Reclaim Statistics
                             traceonly-v3r1  stackreduce-v3r1     nodirect-v3r9
Direct reclaims                              14006       6421      17068 
Direct reclaim pages scanned               1751731     795431    1252629 
Direct reclaim write async I/O               58918      51257          0 
Direct reclaim write sync I/O                   38         27          0 
Wake kswapd requests                       1172938     632220    1082273 
Kswapd wakeups                                 313        260        299 
Kswapd pages scanned                       3824966    3273876    3754542 
Kswapd reclaim write async I/O              860177     460838     565028 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)            377.09     267.92     180.24 
Time kswapd awake (ms)                      295.44     264.65     243.42 

User/Sys Time Running Test (seconds)       8640.01   8668.76   8693.93
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
Total Elapsed Time (seconds)               4185.68   4184.70   4088.38
Percentage Time kswapd Awake                 0.01%     0.00%     0.00%

Interestingly, lumpy reclaim sync writing pages was a factor in this test
which I didn't expect - probably for stack allocations of new processes. Dirty
pages are being encountered by kswapd but as a percentage of scanning,
it's reduced by the patches as well as the amount of time stalled in direct
reclaim and the time kswapd is awake. The main thing is that OOMs were not
unexpectedly triggered.

Overall, this series appears to improve things from an IO perspective -
at least in terms of the amount being generated by the VM and the time
spent handling it. I do have some concerns about the variability of dirty
pages as a ratio of scanned pages encountered but with the tracepoints,
it's something that can be investigated further. Even if we get that ratio
down, it's still a case that direct reclaim should not write pages to avoid
a stack overflow. If writing back pages is found to be a requirement for
whatever reason, nothing in this series prevents a future patch doing direct
writeback in a separate stack but with this approach, more IO is punted to
the flusher threads which should be desirable to the FS folk.

Comments?

 .../trace/postprocess/trace-vmscan-postprocess.pl  |  654 ++++++++++++++++++++
 fs/btrfs/disk-io.c                                 |   21 +-
 fs/xfs/linux-2.6/xfs_aops.c                        |   15 -
 include/linux/memcontrol.h                         |    5 -
 include/linux/mmzone.h                             |   15 -
 include/trace/events/gfpflags.h                    |   37 ++
 include/trace/events/kmem.h                        |   38 +--
 include/trace/events/vmscan.h                      |  184 ++++++
 mm/memcontrol.c                                    |   31 -
 mm/page_alloc.c                                    |    2 -
 mm/vmscan.c                                        |  560 ++++++++++--------
 mm/vmstat.c                                        |    2 -
 12 files changed, 1190 insertions(+), 374 deletions(-)
 create mode 100644 Documentation/trace/postprocess/trace-vmscan-postprocess.pl
 create mode 100644 include/trace/events/gfpflags.h
 create mode 100644 include/trace/events/vmscan.h

^ permalink raw reply	[flat|nested] 105+ messages in thread