linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/13] IO-less dirty throttling v2
@ 2010-11-17  4:27 Wu Fengguang
  2010-11-17  4:27 ` [PATCH 01/13] writeback: IO-less balance_dirty_pages() Wu Fengguang
                   ` (13 more replies)
  0 siblings, 14 replies; 71+ messages in thread
From: Wu Fengguang @ 2010-11-17  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Dave Chinner, Theodore Ts'o,
	Chris Mason, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Wu Fengguang, linux-mm, linux-fsdevel, LKML

Andrew,

This is a revised subset of "[RFC] soft and dynamic dirty throttling limits"
<http://thread.gmane.org/gmane.linux.kernel.mm/52966>.

The basic idea is to introduce a small region under the bdi dirty threshold.
The task will be throttled gently when stepping into the bottom of region,
and get throttled more and more aggressively as bdi dirty+writeback pages
goes up closer to the top of region. At some point the application will be
throttled at the right bandwidth that balances with the device write bandwidth.
(the first patch and documentation has more details)

Changes from initial RFC:

- adaptive rate limiting, to reduce overheads when under throttle threshold
- prevent overrunning dirty limit on lots of concurrent dirtiers
- add Documentation/filesystems/writeback-throttling-design.txt
- lower max pause time from 200ms to 100ms; min pause time from 10ms to 1jiffy
- don't drop the laptop mode code
- update and comment the trace event
- benchmarks on concurrent dd and fs_mark covering both large and tiny files
- bdi->write_bandwidth updates should be rate limited on concurrent dirtiers,
  otherwise it will drift fast and fluctuate
- don't call balance_dirty_pages_ratelimit() when writing to already dirtied
  pages, otherwise the task will be throttled too much

The patches are based on 2.6.37-rc2 and Jan's sync livelock patches. For easier
access I put them in

git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v2

Wu Fengguang (12):
      writeback: IO-less balance_dirty_pages()
      writeback: consolidate variable names in balance_dirty_pages()
      writeback: per-task rate limit on balance_dirty_pages()
      writeback: prevent duplicate balance_dirty_pages_ratelimited() calls
      writeback: bdi write bandwidth estimation
      writeback: show bdi write bandwidth in debugfs
      writeback: quit throttling when bdi dirty pages dropped
      writeback: reduce per-bdi dirty threshold ramp up time
      writeback: make reasonable gap between the dirty/background thresholds
      writeback: scale down max throttle bandwidth on concurrent dirtiers
      writeback: add trace event for balance_dirty_pages()
      writeback: make nr_to_write a per-file limit

Jan Kara (1):
      writeback: account per-bdi accumulated written pages

 .../filesystems/writeback-throttling-design.txt    |  210 +++++++++++++
 fs/fs-writeback.c                                  |   16 +
 include/linux/backing-dev.h                        |    3 +
 include/linux/sched.h                              |    7 +
 include/linux/writeback.h                          |   14 +
 include/trace/events/writeback.h                   |   61 ++++-
 mm/backing-dev.c                                   |   29 +-
 mm/filemap.c                                       |    5 +-
 mm/memory_hotplug.c                                |    3 -
 mm/page-writeback.c                                |  320 +++++++++++---------
 10 files changed, 511 insertions(+), 157 deletions(-)

It runs smoothly on typical configurations. Under small memory system the pause
time will fluctuate much more due to the limited range for soft throttling.

The soft dirty threshold is now lowered to (background + dirty)/2=15%. So it
will be throttling the applications a bit earlier, and may be perceived by end
users as performance "slow down" if his application happens to dirty a bit more
than 15%. Note that vanilla kernel also has this limit at fresh boot: it starts
checking bdi limits when exceeding the global 15%, however the bdi limit ramps
up pretty slowly in common configurations, so the task is immediately throttled.

The task's think time is not considered for now when computing the pause time.
So it will throttle an "scp" over network way harder than a local "cp". When
to take the user space think time into account and ensure accurate throttle
bandwidth, we will effectively create a simple write I/O bandwidth controller.

On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and
improves IO throughput from 38MB/s to 42MB/s.

The fs_mark benchmark is interesting. The CPU overheads are almost reduced by
half. Before patch the benchmark is actually bounded by CPU. After patch it's
IO bound, but strangely the throughput becomes slightly slower.

#  ./fs_mark  -D  10000  -S0  -n  100000  -s  1  -L  63  -d  /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7  -d  /mnt/scratch/8  -d  /mnt/scratch/9  -d  /mnt/scratch/10  -d  /mnt/scratch/11 
#       Version 3.3, 12 thread(s) starting at Thu Nov 11 21:01:36 2010
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 1 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.
#

2.6.36
FSUse%        Count         Size    Files/sec     App Overhead
     0      1200000            1       1261.7        524762513
     0      2400000            1       1195.3        537844546
     0      3600000            1       1231.9        496441566
     1      4800000            1       1175.8        552421522
     1      6000000            1       1191.6        558529735
     1      7200000            1       1165.3        551178395
     2      8400000            1       1175.0        533209632
     2      9600000            1       1200.6        534862246
     2     10800000            1       1181.2        540616486
     2     12000000            1       1137.4        554551797
     3     13200000            1       1143.7        563319651
     3     14400000            1       1169.0        519527533
     3     15600000            1       1184.0        533550370
     4     16800000            1       1161.3        534358727
     4     18000000            1       1193.4        521610050
     4     19200000            1       1177.6        524117437
     5     20400000            1       1172.6        506166634
     5     21600000            1       1172.3        515725633

avg                                    1182.761      533488581.833

2.6.36+
FSUse%        Count         Size    Files/sec     App Overhead
     0      1200000            1       1125.0        357885976
     0      2400000            1       1155.6        288103795
     0      3600000            1       1172.4        296521755
     1      4800000            1       1136.0        301718887
     1      6000000            1       1156.7        303605077
     1      7200000            1       1102.9        288852150
     2      8400000            1       1140.9        294894485
     2      9600000            1       1148.0        314394450
     2     10800000            1       1099.7        296365560
     2     12000000            1       1153.6        316283083
     3     13200000            1       1087.9        339988006
     3     14400000            1       1183.9        270836344
     3     15600000            1       1122.7        276400918
     4     16800000            1       1132.1        285272223
     4     18000000            1       1154.8        283424055
     4     19200000            1       1202.5        294558877
     5     20400000            1       1158.1        293971332
     5     21600000            1       1159.4        287720335
     5     22800000            1       1150.1        282987509
     5     24000000            1       1150.7        283870613
     6     25200000            1       1123.8        288094185
     6     26400000            1       1152.1        296984323
     6     27600000            1       1190.7        282403174
     7     28800000            1       1088.6        290493643
     7     30000000            1       1144.1        290311419
     7     31200000            1       1186.0        290021271
     7     32400000            1       1213.9        279465138
     8     33600000            1       1117.3        275745401

avg                                    1146.768      294684785.143


I noticed that

1) BdiWriteback can grow very large. For example, bdi 8:16 has 72960KB
   writeback pages, however the disk IO queue can hold at most
   nr_request*max_sectors_kb=128*512kb=64MB writeback pages. Maybe xfs manages
   to create perfect sequential layouts and writes, and the other 8MB writeback
   pages are flying inside the disk?

	root@wfg-ne02 /cc/fs_mark-3.3/ne02-2.6.36+# g BdiWriteback /debug/bdi/8:*/*
	/debug/bdi/8:0/stats:BdiWriteback:            0 kB
	/debug/bdi/8:112/stats:BdiWriteback:        68352 kB
	/debug/bdi/8:128/stats:BdiWriteback:        62336 kB
	/debug/bdi/8:144/stats:BdiWriteback:        61824 kB
	/debug/bdi/8:160/stats:BdiWriteback:        67328 kB
	/debug/bdi/8:16/stats:BdiWriteback:        72960 kB
	/debug/bdi/8:176/stats:BdiWriteback:        57984 kB
	/debug/bdi/8:192/stats:BdiWriteback:        71936 kB
	/debug/bdi/8:32/stats:BdiWriteback:        68352 kB
	/debug/bdi/8:48/stats:BdiWriteback:        56704 kB
	/debug/bdi/8:64/stats:BdiWriteback:        50304 kB
	/debug/bdi/8:80/stats:BdiWriteback:        68864 kB
	/debug/bdi/8:96/stats:BdiWriteback:         2816 kB

2) the 12 disks are not all 100% utilized. Not even close: sdd, sdf, sdh, sdj
   are almost idle at the moment. Dozens of seconds later, some other disks
   become idle. This happens both before/after patch. There may be some hidden
   bugs (unrelated to this patchset).

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.17    0.00   97.87    1.08    0.00    0.88

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdc               0.00    63.00    0.00  125.00     0.00  1909.33    30.55     3.88   31.65   6.57  82.13
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sde               0.00    19.00    0.00  112.00     0.00  1517.17    27.09     3.95   35.33   8.00  89.60
sdg               0.00    92.67    0.33  126.00     2.67  1773.33    28.12    14.83  120.78   7.73  97.60
sdf               0.00    32.33    0.00   91.67     0.00  1408.17    30.72     4.84   52.97   7.72  70.80
sdh               0.00    17.67    0.00    5.00     0.00   124.00    49.60     0.07   13.33   9.60   4.80
sdi               0.00    44.67    0.00    5.00     0.00   253.33   101.33     0.15   29.33  10.93   5.47
sdl               0.00   168.00    0.00  135.67     0.00  2216.33    32.67     6.41   45.42   5.75  78.00
sdk               0.00   225.00    0.00  123.00     0.00  2355.83    38.31     9.50   73.03   6.94  85.33
sdj               0.00     1.00    0.00    2.33     0.00    26.67    22.86     0.01    2.29   1.71   0.40
sdb               0.00    14.33    0.00  101.67     0.00  1278.00    25.14     2.02   19.95   7.16  72.80
sdm               0.00   150.33    0.00  144.33     0.00  2344.50    32.49     5.43   33.94   5.39  77.73

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.12    0.00   98.63    0.83    0.00    0.42

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdc               0.00   105.67    0.00  127.33     0.00  1810.17    28.43     4.39   32.43   6.67  84.93
sdd               0.00     5.33    0.00   10.67     0.00   128.00    24.00     0.03    2.50   1.25   1.33
sde               0.00   180.33    0.33  107.67     2.67  2109.33    39.11     8.11   73.93   8.99  97.07
sdg               0.00     7.67    0.00   63.67     0.00  1387.50    43.59     1.45   24.29  11.08  70.53
sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdh               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdi               0.00    62.67    0.00   94.67     0.00  1743.50    36.83     3.28   34.68   8.52  80.67
sdl               0.00   162.00    0.00  141.67     0.00  2295.83    32.41     7.09   51.79   6.14  86.93
sdk               0.00    34.33    0.00  143.67     0.00  1910.17    26.59     5.07   38.90   6.26  90.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb               0.00   195.00    0.00   96.67     0.00  1949.50    40.33     5.54   57.23   8.39  81.07
sdm               0.00   155.00    0.00  143.00     0.00  2357.50    32.97     5.21   39.98   5.71  81.60

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-11-17  4:27 [PATCH 00/13] IO-less dirty throttling v2 Wu Fengguang
@ 2010-11-17  4:27 ` Wu Fengguang
  2010-11-17 10:34   ` Minchan Kim
                     ` (2 more replies)
  2010-11-17  4:27 ` [PATCH 02/13] writeback: consolidate variable names in balance_dirty_pages() Wu Fengguang
                   ` (12 subsequent siblings)
  13 siblings, 3 replies; 71+ messages in thread
From: Wu Fengguang @ 2010-11-17  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Chris Mason, Dave Chinner, Peter Zijlstra, Jens Axboe,
	Wu Fengguang, Christoph Hellwig, Theodore Ts'o, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bw-throttle.patch --]
[-- Type: text/plain, Size: 28688 bytes --]

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

This patch introduces the basic framework, which will be further
consolidated by the next patches.

RATIONALS
=========

The current balance_dirty_pages() is rather IO inefficient.

- concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

For the above two reasons, it's much better to shift IO to the flusher
threads and let balance_dirty_pages() just wait for enough time or progress.

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling 
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than  10ms, which burns CPU power)
- avoid too large pause time (more than 100ms, which hurts responsiveness)
- avoid big fluctuations of pause times

For example, when doing a simple cp on ext4 with mem=4G HZ=250.

before patch, the pause time fluctuates from 0 to 324ms
(and the stall time may grow very large for slow devices)

[ 1237.139962] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
[ 1237.207489] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.225190] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.234488] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.244692] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.375231] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1237.443035] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1237.574630] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1237.642394] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1237.666320] balance_dirty_pages: write_chunk=1536 pages_written=57 pause=5
[ 1237.973365] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=81
[ 1238.212626] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
[ 1238.280431] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1238.412029] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1238.412791] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0

after patch, the pause time remains stable around 32ms

cp-2687  [002]  1452.237012: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [002]  1452.246157: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [006]  1452.253043: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [006]  1452.261899: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [006]  1452.268939: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [002]  1452.276932: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [002]  1452.285889: balance_dirty_pages: weight=57% dirtied=128 pause=8

CONTROL SYSTEM
==============

The current task_dirty_limit() adjusts bdi_dirty_limit to get
task_dirty_limit according to the dirty "weight" of the current task,
which is the percent of pages recently dirtied by the task. If 100%
pages are recently dirtied by the task, it will lower bdi_dirty_limit by
1/8. If only 1% pages are dirtied by the task, it will return almost
unmodified bdi_dirty_limit. In this way, a heavy dirtier will get
blocked at task_dirty_limit=(bdi_dirty_limit-bdi_dirty_limit/8) while
allowing a light dirtier to progress (the latter won't be blocked
because R << B in fig.1).

Fig.1 before patch, a heavy dirtier and a light dirtier
                                                R
----------------------------------------------+-o---------------------------*--|
                                              L A                           B  T
  T: bdi_dirty_limit, as returned by bdi_dirty_limit()
  L: T - T/8

  R: bdi_reclaimable + bdi_writeback

  A: task_dirty_limit for a heavy dirtier ~= R ~= L
  B: task_dirty_limit for a light dirtier ~= T

Since each process has its own dirty limit, we reuse A/B for the tasks as
well as their dirty limits.

If B is a newly started heavy dirtier, then it will slowly gain weight
and A will lose weight.  The task_dirty_limit for A and B will be
approaching the center of region (L, T) and eventually stabilize there.

Fig.2 before patch, two heavy dirtiers converging to the same threshold
                                                             R
----------------------------------------------+--------------o-*---------------|
                                              L              A B               T

Fig.3 after patch, one heavy dirtier
                                                |
    throttle_bandwidth ~= bdi_bandwidth  =>     o
                                                | o
                                                |   o
                                                |     o
                                                |       o
                                                |         o
                                              La|           o
----------------------------------------------+-+-------------o----------------|
                                                R             A                T
  T: bdi_dirty_limit
  A: task_dirty_limit      = T - Wa * T/16
  La: task_throttle_thresh = A - A/16

  R: bdi_dirty_pages = bdi_reclaimable + bdi_writeback ~= La

Now for IO-less balance_dirty_pages(), let's do it in a "bandwidth control"
way. In fig.3, a soft dirty limit region (La, A) is introduced. When R enters
this region, the task may be throttled for J jiffies on every N pages it dirtied.
Let's call (N/J) the "throttle bandwidth". It is computed by the following formula:

        throttle_bandwidth = bdi_bandwidth * (A - R) / (A - La)
where
	A = T - Wa * T/16
        La = A - A/16
where Wa is task weight for A. It's 0 for very light dirtier and 1 for
the one heavy dirtier (that consumes 100% bdi write bandwidth).  The
task weight will be updated independently by task_dirty_inc() at
set_page_dirty() time.

When R < La, we don't throttle it at all.
When R > A, the code will detect the negativeness and choose to pause
100ms (the upper pause boundary), then loop over again.


PSEUDO CODE
===========

balance_dirty_pages():

	/* soft throttling */
	if (task_throttle_thresh exceeded)
		sleep (task_dirtied_pages / throttle_bandwidth)

	/* hard throttling */
	while (task_dirty_limit exceeded) {
		sleep 100ms
		if (bdi_dirty_pages dropped more than task_dirtied_pages)
			break
	}

	/* global hard limit */
	while (dirty_limit exceeded)
		sleep 100ms

Basically there are three level of throttling now.

- normally the dirtier will be adaptively throttled with good timing

- when task_dirty_limit is exceeded, the task will be throttled until
  bdi dirty/writeback pages go down reasonably large

- when dirty_thresh is exceeded, the task can be throttled for arbitrary
  long time


BEHAVIOR CHANGE
===============

Users will notice that the applications will get throttled once the
crossing the global (background + dirty)/2=15% threshold. For a single
"cp", it could be soft throttled at 8*bdi->write_bandwidth around 15%
dirty pages, and be balanced at speed bdi->write_bandwidth around 17.5%
dirty pages. Before patch, the behavior is to just throttle it at 17.5%
dirty pages.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than ~15% memory.


BENCHMARKS
==========

The test box has a 4-core 3.2GHz CPU, 4GB mem and a SATA disk.

For each filesystem, the following command is run 3 times.

time (dd if=/dev/zero of=/tmp/10G bs=1M count=10240; sync); rm /tmp/10G

	    2.6.36-rc2-mm1	2.6.36-rc2-mm1+balance_dirty_pages
average real time
ext2        236.377s            232.144s              -1.8%
ext3        226.245s            225.751s              -0.2%
ext4        178.742s            179.343s              +0.3%
xfs         183.562s            179.808s              -2.0%
btrfs       179.044s            179.461s              +0.2%
NFS         645.627s            628.937s              -2.6%

average system time
ext2         22.142s             19.656s             -11.2%
ext3         34.175s             32.462s              -5.0%
ext4         23.440s             21.162s              -9.7%
xfs          19.089s             16.069s             -15.8%
btrfs        12.212s             11.670s              -4.4%
NFS          16.807s             17.410s              +3.6%

total user time
sum           0.136s              0.084s             -38.2%

In a more recent run of the tests, it's in fact slightly slower.

ext2         49.500 MB/s         49.200 MB/s          -0.6%
ext3         50.133 MB/s         50.000 MB/s          -0.3%
ext4         64.000 MB/s         63.200 MB/s          -1.2%
xfs          63.500 MB/s         63.167 MB/s          -0.5%
btrfs        63.133 MB/s         63.033 MB/s          -0.2%
NFS          16.833 MB/s         16.867 MB/s          +0.2%

In general there are no big IO performance changes for desktop users,
except for some noticeable reduction of CPU overheads. It mainly
benefits file servers with heavy concurrent writers on fast storage
arrays. As can be demonstrated by 10/100 concurrent dd's on xfs:

- 1 dirtier case:    the same
- 10 dirtiers case:  CPU system time is reduced to 50%
- 100 dirtiers case: CPU system time is reduced to 10%, IO size and throughput increases by 10%

			2.6.37-rc2				2.6.37-rc1-next-20101115+
        ----------------------------------------        ----------------------------------------
	%system		wkB/s		avgrq-sz	%system		wkB/s		avgrq-sz
100dd	30.916		37843.000	748.670		3.079		41654.853	822.322
100dd	30.501		37227.521	735.754		3.744		41531.725	820.360

10dd	39.442		47745.021	900.935		20.756		47951.702	901.006
10dd	39.204		47484.616	899.330		20.550		47970.093	900.247

1dd	13.046		57357.468	910.659		13.060		57632.715	909.212
1dd	12.896		56433.152	909.861		12.467		56294.440	909.644

The CPU overheads in 2.6.37-rc1-next-20101115+ is higher than
2.6.36-rc2-mm1+balance_dirty_pages, this may be due to the pause time
stablizing at lower values due to some algorithm adjustments (eg.
reduce the minimal pause time from 10ms to 1jiffy in new version)
leading to much more balance_dirty_pages() calls. The different pause
time also explains the different system time for 1/10/100dd cases on
the same 2.6.37-rc1-next-20101115+.

CC: Chris Mason <chris.mason@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/filesystems/writeback-throttling-design.txt |  210 ++++++++++
 include/linux/writeback.h                                 |   10 
 mm/page-writeback.c                                       |   85 +---
 3 files changed, 249 insertions(+), 56 deletions(-)

--- linux-next.orig/include/linux/writeback.h	2010-11-15 19:49:41.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-11-15 19:49:42.000000000 +0800
@@ -12,6 +12,16 @@ struct backing_dev_info;
 extern spinlock_t inode_lock;
 
 /*
+ * The 1/8 region under the bdi dirty threshold is set aside for elastic
+ * throttling. In rare cases when the threshold is exceeded, more rigid
+ * throttling will be imposed, which will inevitably stall the dirtier task
+ * for seconds (or more) at _one_ time. The rare case could be a fork bomb
+ * where every new task dirties some more pages.
+ */
+#define BDI_SOFT_DIRTY_LIMIT	8
+#define TASK_SOFT_DIRTY_LIMIT	(BDI_SOFT_DIRTY_LIMIT * 2)
+
+/*
  * fs/fs-writeback.c
  */
 enum writeback_sync_modes {
--- linux-next.orig/mm/page-writeback.c	2010-11-15 19:49:41.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-11-15 19:50:16.000000000 +0800
@@ -42,20 +42,6 @@
  */
 static long ratelimit_pages = 32;
 
-/*
- * When balance_dirty_pages decides that the caller needs to perform some
- * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than dirtied pages to ensure that reasonably
- * large amounts of I/O are submitted.
- */
-static inline long sync_writeback_pages(unsigned long dirtied)
-{
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
-	return dirtied + dirtied / 2;
-}
-
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
@@ -279,7 +265,7 @@ static unsigned long task_dirty_limit(st
 {
 	long numerator, denominator;
 	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty >> 3;
+	u64 inv = dirty / TASK_SOFT_DIRTY_LIMIT;
 
 	task_dirties_fraction(tsk, &numerator, &denominator);
 	inv *= numerator;
@@ -473,26 +459,25 @@ unsigned long bdi_dirty_limit(struct bac
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
 	long nr_reclaimable, bdi_nr_reclaimable;
 	long nr_writeback, bdi_nr_writeback;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
+	unsigned long bw;
+	unsigned long pause;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
 	for (;;) {
-		struct writeback_control wbc = {
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-			.range_cyclic	= 1,
-		};
-
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_writeback = global_page_state(NR_WRITEBACK);
@@ -529,6 +514,23 @@ static void balance_dirty_pages(struct a
 			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
+		if (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh) {
+			pause = HZ/10;
+			goto pause;
+		}
+
+		bw = 100 << 20; /* use static 100MB/s for the moment */
+
+		bw = bw * (bdi_thresh - (bdi_nr_reclaimable + bdi_nr_writeback));
+		bw = bw / (bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
+
+		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
+		pause = clamp_val(pause, 1, HZ/10);
+
+pause:
+		__set_current_state(TASK_INTERRUPTIBLE);
+		io_schedule_timeout(pause);
+
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the
 		 * global "hard" limit. The former helps to prevent heavy IO
@@ -544,35 +546,6 @@ static void balance_dirty_pages(struct a
 
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
-
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_wbc_balance_dirty_start(&wbc, bdi);
-		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes_wb(&bdi->wb, &wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			trace_wbc_balance_dirty_written(&wbc, bdi);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
-		}
-		trace_wbc_balance_dirty_wait(&wbc, bdi);
-		__set_current_state(TASK_INTERRUPTIBLE);
-		io_schedule_timeout(pause);
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
 	}
 
 	if (!dirty_exceeded && bdi->dirty_exceeded)
@@ -589,7 +562,7 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
+	if ((laptop_mode && dirty_exceeded) ||
 	    (!laptop_mode && (nr_reclaimable > background_thresh)))
 		bdi_start_background_writeback(bdi);
 }
@@ -638,7 +611,7 @@ void balance_dirty_pages_ratelimited_nr(
 	p =  &__get_cpu_var(bdp_ratelimits);
 	*p += nr_pages_dirtied;
 	if (unlikely(*p >= ratelimit)) {
-		ratelimit = sync_writeback_pages(*p);
+		ratelimit = *p;
 		*p = 0;
 		preempt_enable();
 		balance_dirty_pages(mapping, ratelimit);
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-next/Documentation/filesystems/writeback-throttling-design.txt	2010-11-15 19:49:42.000000000 +0800
@@ -0,0 +1,210 @@
+writeback throttling design
+---------------------------
+
+introduction to dirty throttling
+--------------------------------
+
+The write(2) is normally buffered write that creates dirty page cache pages
+for holding the data and return immediately. The dirty pages will eventually
+be written to disk, or be dropped by unlink()/truncate().
+
+The delayed writeback of dirty pages enables the kernel to optimize the IO:
+
+- turn IO into async ones, which avoids blocking the tasks
+- submit IO as a batch for better throughput
+- avoid IO at all for temp files
+
+However, there have to be some limits on the number of allowable dirty pages.
+Typically applications are able to dirty pages more quickly than storage
+devices can write them. When approaching the dirty limits, the dirtier tasks
+will be throttled (put to brief sleeps from time to time) by
+balance_dirty_pages() in order to balance the dirty speed and writeback speed.
+
+dirty limits
+------------
+
+The dirty limit defaults to 20% reclaimable memory, and can be tuned via one of
+the following sysctl interfaces:
+
+	/proc/sys/vm/dirty_ratio
+	/proc/sys/vm/dirty_bytes
+
+The ultimate goal of balance_dirty_pages() is to keep the global dirty pages
+under control.
+
+	dirty_limit = dirty_ratio * free_reclaimable_pages
+
+However a global threshold may create deadlock for stacked BDIs (loop, FUSE and
+local NFS mounts). When A writes to B, and A generates enough dirty pages to
+get throttled, B will never start writeback until the dirty pages go away.
+
+Another problem is inter device starvation. When there are concurrent writes to
+a slow device and a fast one, the latter may well be starved due to unnecessary
+throttling on its dirtier tasks, leading to big IO performance drop.
+
+The solution is to split the global dirty limit into per-bdi limits among all
+the backing devices and scale writeback cache per backing device, proportional
+to its writeout speed.
+
+	bdi_dirty_limit = bdi_weight * dirty_limit
+
+where bdi_weight (ranging from 0 to 1) reflects the recent writeout speed of
+the BDI.
+
+We further scale the bdi dirty limit inversly with the task's dirty rate.
+This makes heavy writers have a lower dirty limit than the occasional writer,
+to prevent a heavy dd from slowing down all other light writers in the system.
+
+	task_dirty_limit = bdi_dirty_limit - task_weight * bdi_dirty_limit/16
+
+pause time
+----------
+
+The main task of dirty throttling is to determine when and how long to pause
+the current dirtier task.  Basically we want to
+
+- avoid too small pause time (less than 1 jiffy, which burns CPU power)
+- avoid too large pause time (more than 100ms, which hurts responsiveness)
+- avoid big fluctuations of pause times
+
+To smoothly control the pause time, we do soft throttling in a small region
+under task_dirty_limit, starting from
+
+	task_throttle_thresh = task_dirty_limit - task_dirty_limit/16
+
+In fig.1, when bdi_dirty_pages falls into
+
+    [0, La]:    do nothing
+    [La, A]:    do soft throttling
+    [A, inf]:   do hard throttling
+
+Where hard throttling is to wait until bdi_dirty_pages falls more than
+task_dirtied_pages (the pages dirtied by the task since its last throttle
+time). It's "hard" because it may end up waiting for long time.
+
+Fig.1 dirty throttling regions
+                                              o
+                                                o
+                                                  o
+                                                    o
+                                                      o
+                                                        o
+                                                          o
+                                                            o
+----------------------------------------------+---------------o----------------|
+                                              La              A                T
+                no throttle                     soft throttle   hard throttle
+  T: bdi_dirty_limit
+  A: task_dirty_limit      = T - task_weight * T/16
+  La: task_throttle_thresh = A - A/16
+
+Soft dirty throttling is to pause the dirtier task for J:pause_time jiffies on
+every N:task_dirtied_pages pages it dirtied.  Let's call (N/J) the "throttle
+bandwidth". It is computed by the following formula:
+
+                                     task_dirty_limit - bdi_dirty_pages
+throttle_bandwidth = bdi_bandwidth * ----------------------------------
+                                           task_dirty_limit/16
+
+where bdi_bandwidth is the BDI's estimated write speed.
+
+Given the throttle_bandwidth for a task, we select a suitable N, so that when
+the task dirties so much pages, it enters balance_dirty_pages() to sleep for
+roughly J jiffies. N is adaptive to storage and task write speeds, so that the
+task always get suitable (not too long or small) pause time.
+
+dynamics
+--------
+
+When there is one heavy dirtier, bdi_dirty_pages will keep growing until
+exceeding the low threshold of the task's soft throttling region [La, A].
+At which point (La) the task will be controlled under speed
+throttle_bandwidth=bdi_bandwidth (fig.2) and remain stable there.
+
+Fig.2 one heavy dirtier
+
+    throttle_bandwidth ~= bdi_bandwidth  =>   o
+                                              | o
+                                              |   o
+                                              |     o
+                                              |       o
+                                              |         o
+                                              |           o
+                                            La|             o
+----------------------------------------------+---------------o----------------|
+                                              R               A                T
+  R: bdi_dirty_pages ~= La
+
+When there comes a new dd task B, task_weight_B will gradually grow from 0 to
+50% while task_weight_A will decrease from 100% to 50%.  When task_weight_B is
+still small, B is considered a light dirtier and is allowed to dirty pages much
+faster than the bdi write bandwidth. In fact initially it won't be throttled at
+all when R < Lb where Lb = B - B/16 and B ~= T.
+
+Fig.3 an old dd (A) + a newly started dd (B)
+
+                      throttle bandwidth  =>    *
+                                                | *
+                                                |   *
+                                                |     *
+                                                |       *
+                                                |         *
+                                                |           *
+                                                |             *
+                      throttle bandwidth  =>    o               *
+                                                | o               *
+                                                |   o               *
+                                                |     o               *
+                                                |       o               *
+                                                |         o               *
+                                                |           o               *
+------------------------------------------------+-------------o---------------*|
+                                                R             A               BT
+
+So R:bdi_dirty_pages will grow large. As task_weight_A and task_weight_B
+converge to 50%, the points A, B will go towards each other (fig.4) and
+eventually coincide with each other. R will stabilize around A-A/32 where
+A=B=T-0.5*T/16.  throttle_bandwidth will stabilize around bdi_bandwidth/2.
+
+Note that the application "think+dirty time" is ignored for simplicity in the
+above discussions. With non-zero user space think time, the balance point will
+slightly drift and not a big deal otherwise.
+
+Fig.4 the two dd's converging to the same bandwidth
+
+                                                         |
+                                 throttle bandwidth  =>  *
+                                                         | *
+                                 throttle bandwidth  =>  o   *
+                                                         | o   *
+                                                         |   o   *
+                                                         |     o   *
+                                                         |       o   *
+                                                         |         o   *
+---------------------------------------------------------+-----------o---*-----|
+                                                         R           A   B     T
+
+There won't be big oscillations between A and B, because as soon as A coincides
+with B, their throttle_bandwidth and hence dirty speed will be equal, A's
+weight will stop decreasing and B's weight will stop growing, so the two points
+won't keep moving and cross each other.
+
+Sure there are always oscillations of bdi_dirty_pages as long as the dirtier
+task alternatively do dirty and pause. But it will be bounded. When there is 1
+heavy dirtier, the error bound will be (pause_time * bdi_bandwidth). When there
+are 2 heavy dirtiers, the max error is 2 * (pause_time * bdi_bandwidth/2),
+which remains the same as 1 dirtier case (given the same pause time). In fact
+the more dirtier tasks, the less errors will be, since the dirtier tasks are
+not likely going to sleep at the same time.
+
+References
+----------
+
+Smarter write throttling
+http://lwn.net/Articles/245600/
+
+Flushing out pdflush
+http://lwn.net/Articles/326552/
+
+Dirty throttling slides
+http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling.pdf



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 02/13] writeback: consolidate variable names in balance_dirty_pages()
  2010-11-17  4:27 [PATCH 00/13] IO-less dirty throttling v2 Wu Fengguang
  2010-11-17  4:27 ` [PATCH 01/13] writeback: IO-less balance_dirty_pages() Wu Fengguang
@ 2010-11-17  4:27 ` Wu Fengguang
  2010-11-17  4:27 ` [PATCH 03/13] writeback: per-task rate limit on balance_dirty_pages() Wu Fengguang
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 71+ messages in thread
From: Wu Fengguang @ 2010-11-17  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-cleanup-name-merge.patch --]
[-- Type: text/plain, Size: 3063 bytes --]

Lots of lenthy tests.. Let's compact the names

	*_dirty = NR_FILE_DIRTY + NR_WRITEBACK + NR_UNSTABLE_NFS

balance_dirty_pages() only cares about the above dirty sum except
in one place -- on starting background writeback.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-11-15 19:50:16.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-11-15 19:50:27.000000000 +0800
@@ -461,8 +461,8 @@ unsigned long bdi_dirty_limit(struct bac
 static void balance_dirty_pages(struct address_space *mapping,
 				unsigned long pages_dirtied)
 {
-	long nr_reclaimable, bdi_nr_reclaimable;
-	long nr_writeback, bdi_nr_writeback;
+	long nr_reclaimable;
+	long nr_dirty, bdi_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
@@ -480,7 +480,7 @@ static void balance_dirty_pages(struct a
 		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
 
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 
@@ -489,8 +489,7 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_reclaimable + nr_writeback <=
-				(background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -507,21 +506,21 @@ static void balance_dirty_pages(struct a
 		 * deltas.
 		 */
 		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
+			bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
+				    bdi_stat_sum(bdi, BDI_WRITEBACK);
 		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+			bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) +
+				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
-		if (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh) {
+		if (bdi_dirty >= bdi_thresh) {
 			pause = HZ/10;
 			goto pause;
 		}
 
 		bw = 100 << 20; /* use static 100MB/s for the moment */
 
-		bw = bw * (bdi_thresh - (bdi_nr_reclaimable + bdi_nr_writeback));
+		bw = bw * (bdi_thresh - bdi_dirty);
 		bw = bw / (bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
 
 		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
@@ -537,9 +536,8 @@ pause:
 		 * bdi or process from holding back light ones; The latter is
 		 * the last resort safeguard.
 		 */
-		dirty_exceeded =
-			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
-			|| (nr_reclaimable + nr_writeback > dirty_thresh);
+		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
+				  (nr_dirty > dirty_thresh);
 
 		if (!dirty_exceeded)
 			break;



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 03/13] writeback: per-task rate limit on balance_dirty_pages()
  2010-11-17  4:27 [PATCH 00/13] IO-less dirty throttling v2 Wu Fengguang
  2010-11-17  4:27 ` [PATCH 01/13] writeback: IO-less balance_dirty_pages() Wu Fengguang
  2010-11-17  4:27 ` [PATCH 02/13] writeback: consolidate variable names in balance_dirty_pages() Wu Fengguang
@ 2010-11-17  4:27 ` Wu Fengguang
  2010-11-17 14:39   ` Wu Fengguang
  2010-11-24 10:23   ` Peter Zijlstra
  2010-11-17  4:27 ` [PATCH 04/13] writeback: prevent duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
                   ` (10 subsequent siblings)
  13 siblings, 2 replies; 71+ messages in thread
From: Wu Fengguang @ 2010-11-17  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-per-task-dirty-count.patch --]
[-- Type: text/plain, Size: 9319 bytes --]

Try to limit the dirty throttle pause time in range [1 jiffy, 100 ms],
by controlling how many pages can be dirtied before inserting a pause.

The dirty count will be directly billed to the task struct. Slow start
and quick back off is employed, so that the stable range will be biased
towards less than 50ms. Another intention is for fine timing control of
slow devices, which may need to do full 100ms pauses for every 1 page.

The switch from per-cpu to per-task rate limit makes it easier to exceed
the global dirty limit with a fork bomb, where each new task dirties 1 page,
sleep 10m and continue to dirty 1000 more pages. The caveat is, when it
dirties the first page, it may be honoured a high nr_dirtied_pause
because nr_dirty is still low at that time. In this way lots of tasks
get the free tickets to dirty more pages than allowed. The solution is
to disable rate limiting (ie. to ignore nr_dirtied_pause) totally once
the bdi becomes dirty exceeded.

Note that some filesystems will dirty a batch of pages before calling
balance_dirty_pages_ratelimited_nr(). They saves a little CPU overheads
at the cost of possibly overrunning the dirty limits a bit and/or in the
case of very slow devices, pause the application for much more than
100ms at a time. This is a tradeoff, and seems reasonable optimization
as long as the batch size is controlled within a dozen pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    7 ++
 mm/memory_hotplug.c   |    3 
 mm/page-writeback.c   |  126 ++++++++++++++++++----------------------
 3 files changed, 65 insertions(+), 71 deletions(-)

--- linux-next.orig/include/linux/sched.h	2010-11-15 21:43:52.000000000 +0800
+++ linux-next/include/linux/sched.h	2010-11-15 21:43:54.000000000 +0800
@@ -1473,6 +1473,13 @@ struct task_struct {
 	int make_it_fail;
 #endif
 	struct prop_local_single dirties;
+	/*
+	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
+	 * balance_dirty_pages() for some dirty throttling pause
+	 */
+	int nr_dirtied;
+	int nr_dirtied_pause;
+
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
 	struct latency_record latency_record[LT_SAVECOUNT];
--- linux-next.orig/mm/page-writeback.c	2010-11-15 21:43:52.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-11-16 14:13:14.000000000 +0800
@@ -36,12 +36,6 @@
 #include <linux/pagevec.h>
 #include <trace/events/writeback.h>
 
-/*
- * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
- * will look to see if it needs to force writeback or throttling.
- */
-static long ratelimit_pages = 32;
-
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
@@ -452,6 +446,40 @@ unsigned long bdi_dirty_limit(struct bac
 }
 
 /*
+ * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
+ * will look to see if it needs to start dirty throttling.
+ *
+ * If ratelimit_pages is too low then big NUMA machines will call the expensive
+ * global_page_state() too often. So scale it adaptively to the safety margin
+ * (the number of pages we may dirty without exceeding the dirty limits).
+ */
+static unsigned long ratelimit_pages(struct backing_dev_info *bdi)
+{
+	unsigned long background_thresh;
+	unsigned long dirty_thresh;
+	unsigned long dirty_pages;
+
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+	dirty_pages = global_page_state(NR_FILE_DIRTY) +
+		      global_page_state(NR_WRITEBACK) +
+		      global_page_state(NR_UNSTABLE_NFS);
+
+	if (dirty_pages <= (dirty_thresh + background_thresh) / 2)
+		goto out;
+
+	dirty_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+	dirty_pages  = bdi_stat(bdi, BDI_RECLAIMABLE) +
+		       bdi_stat(bdi, BDI_WRITEBACK);
+
+	if (dirty_pages < dirty_thresh)
+		goto out;
+
+	return 1;
+out:
+	return 1 + int_sqrt(dirty_thresh - dirty_pages);
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -467,7 +495,7 @@ static void balance_dirty_pages(struct a
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
 	unsigned long bw;
-	unsigned long pause;
+	unsigned long pause = 0;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
@@ -549,6 +577,17 @@ pause:
 	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
+	if (pause == 0 && nr_dirty < background_thresh)
+		current->nr_dirtied_pause = ratelimit_pages(bdi);
+	else if (pause == 1)
+		current->nr_dirtied_pause += current->nr_dirtied_pause >> 5;
+	else if (pause >= HZ/10)
+		/*
+		 * when repeated, writing 1 page per 100ms on slow devices,
+		 * i-(i+2)/4 will be able to reach 1 but never reduce to 0.
+		 */
+		current->nr_dirtied_pause -= (current->nr_dirtied_pause+2) >> 2;
+
 	if (writeback_in_progress(bdi))
 		return;
 
@@ -575,8 +614,6 @@ void set_page_dirty_balance(struct page 
 	}
 }
 
-static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
-
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
  * @mapping: address_space which was dirtied
@@ -586,36 +623,30 @@ static DEFINE_PER_CPU(unsigned long, bdp
  * which was newly dirtied.  The function will periodically check the system's
  * dirty state and will initiate writeback if needed.
  *
- * On really big machines, get_writeback_state is expensive, so try to avoid
+ * On really big machines, global_page_state() is expensive, so try to avoid
  * calling it too often (ratelimiting).  But once we're over the dirty memory
- * limit we decrease the ratelimiting by a lot, to prevent individual processes
- * from overshooting the limit by (ratelimit_pages) each.
+ * limit we disable the ratelimiting, to prevent individual processes from
+ * overshooting the limit by (ratelimit_pages) each.
  */
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 					unsigned long nr_pages_dirtied)
 {
-	unsigned long ratelimit;
-	unsigned long *p;
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+
+	current->nr_dirtied += nr_pages_dirtied;
 
-	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
-		ratelimit = 8;
+	if (unlikely(!current->nr_dirtied_pause))
+		current->nr_dirtied_pause = ratelimit_pages(bdi);
 
 	/*
 	 * Check the rate limiting. Also, we do not want to throttle real-time
 	 * tasks in balance_dirty_pages(). Period.
 	 */
-	preempt_disable();
-	p =  &__get_cpu_var(bdp_ratelimits);
-	*p += nr_pages_dirtied;
-	if (unlikely(*p >= ratelimit)) {
-		ratelimit = *p;
-		*p = 0;
-		preempt_enable();
-		balance_dirty_pages(mapping, ratelimit);
-		return;
+	if (unlikely(current->nr_dirtied >= current->nr_dirtied_pause ||
+		     bdi->dirty_exceeded)) {
+		balance_dirty_pages(mapping, current->nr_dirtied);
+		current->nr_dirtied = 0;
 	}
-	preempt_enable();
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
@@ -703,44 +734,6 @@ void laptop_sync_completion(void)
 #endif
 
 /*
- * If ratelimit_pages is too high then we can get into dirty-data overload
- * if a large number of processes all perform writes at the same time.
- * If it is too low then SMP machines will call the (expensive)
- * get_writeback_state too often.
- *
- * Here we set ratelimit_pages to a level which ensures that when all CPUs are
- * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
- * thresholds before writeback cuts in.
- *
- * But the limit should not be set too high.  Because it also controls the
- * amount of memory which the balance_dirty_pages() caller has to write back.
- * If this is too large then the caller will block on the IO queue all the
- * time.  So limit it to four megabytes - the balance_dirty_pages() caller
- * will write six megabyte chunks, max.
- */
-
-void writeback_set_ratelimit(void)
-{
-	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
-	if (ratelimit_pages < 16)
-		ratelimit_pages = 16;
-	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
-		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
-}
-
-static int __cpuinit
-ratelimit_handler(struct notifier_block *self, unsigned long u, void *v)
-{
-	writeback_set_ratelimit();
-	return NOTIFY_DONE;
-}
-
-static struct notifier_block __cpuinitdata ratelimit_nb = {
-	.notifier_call	= ratelimit_handler,
-	.next		= NULL,
-};
-
-/*
  * Called early on to tune the page writeback dirty limits.
  *
  * We used to scale dirty pages according to how total memory
@@ -762,9 +755,6 @@ void __init page_writeback_init(void)
 {
 	int shift;
 
-	writeback_set_ratelimit();
-	register_cpu_notifier(&ratelimit_nb);
-
 	shift = calc_period_shift();
 	prop_descriptor_init(&vm_completions, shift);
 	prop_descriptor_init(&vm_dirties, shift);
--- linux-next.orig/mm/memory_hotplug.c	2010-11-15 21:43:52.000000000 +0800
+++ linux-next/mm/memory_hotplug.c	2010-11-15 21:43:54.000000000 +0800
@@ -446,8 +446,6 @@ int online_pages(unsigned long pfn, unsi
 
 	vm_total_pages = nr_free_pagecache_pages();
 
-	writeback_set_ratelimit();
-
 	if (onlined_pages)
 		memory_notify(MEM_ONLINE, &arg);
 
@@ -877,7 +875,6 @@ repeat:
 	}
 
 	vm_total_pages = nr_free_pagecache_pages();
-	writeback_set_ratelimit();
 
 	memory_notify(MEM_OFFLINE, &arg);
 	unlock_system_sleep();



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 04/13] writeback: prevent duplicate balance_dirty_pages_ratelimited() calls
  2010-11-17  4:27 [PATCH 00/13] IO-less dirty throttling v2 Wu Fengguang
                   ` (2 preceding siblings ...)
  2010-11-17  4:27 ` [PATCH 03/13] writeback: per-task rate limit on balance_dirty_pages() Wu Fengguang
@ 2010-11-17  4:27 ` Wu Fengguang
  2010-11-17  4:27 ` [PATCH 05/13] writeback: account per-bdi accumulated written pages Wu Fengguang
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 71+ messages in thread
From: Wu Fengguang @ 2010-11-17  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-fix-duplicate-bdp-calls.patch --]
[-- Type: text/plain, Size: 1125 bytes --]

When dd in 512bytes, balance_dirty_pages_ratelimited() used to be called
8 times for the same page, even if the page is only dirtied once. Fix it
with a (slightly racy) PageDirty() test.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/filemap.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/filemap.c	2010-11-16 22:20:08.000000000 +0800
+++ linux-next/mm/filemap.c	2010-11-16 22:45:05.000000000 +0800
@@ -2258,6 +2258,7 @@ static ssize_t generic_perform_write(str
 	long status = 0;
 	ssize_t written = 0;
 	unsigned int flags = 0;
+	unsigned int dirty;
 
 	/*
 	 * Copies from kernel address space cannot fail (NFSD is a big user).
@@ -2306,6 +2307,7 @@ again:
 		pagefault_enable();
 		flush_dcache_page(page);
 
+		dirty = PageDirty(page);
 		mark_page_accessed(page);
 		status = a_ops->write_end(file, mapping, pos, bytes, copied,
 						page, fsdata);
@@ -2332,7 +2334,8 @@ again:
 		pos += copied;
 		written += copied;
 
-		balance_dirty_pages_ratelimited(mapping);
+		if (!dirty)
+			balance_dirty_pages_ratelimited(mapping);
 
 	} while (iov_iter_count(i));
 



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 05/13] writeback: account per-bdi accumulated written pages
  2010-11-17  4:27 [PATCH 00/13] IO-less dirty throttling v2 Wu Fengguang
                   ` (3 preceding siblings ...)
  2010-11-17  4:27 ` [PATCH 04/13] writeback: prevent duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
@ 2010-11-17  4:27 ` Wu Fengguang
  2010-11-24 10:26   ` Peter Zijlstra
  2010-11-17  4:27 ` [PATCH 06/13] writeback: bdi write bandwidth estimation Wu Fengguang
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 71+ messages in thread
From: Wu Fengguang @ 2010-11-17  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bdi-written.patch --]
[-- Type: text/plain, Size: 2134 bytes --]

From: Jan Kara <jack@suse.cz>

Introduce the BDI_WRITTEN counter. It will be used for estimating the
bdi's write bandwidth.

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    6 ++++--
 mm/page-writeback.c         |    1 +
 3 files changed, 6 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2010-11-16 22:48:28.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-11-17 00:18:56.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
 
--- linux-next.orig/mm/backing-dev.c	2010-11-16 22:48:28.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-11-17 00:18:56.000000000 +0800
@@ -92,6 +92,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:   %8lu kB\n"
 		   "DirtyThresh:      %8lu kB\n"
 		   "BackgroundThresh: %8lu kB\n"
+		   "BdiWritten:       %8lu kB\n"
 		   "b_dirty:          %8lu\n"
 		   "b_io:             %8lu\n"
 		   "b_more_io:        %8lu\n"
@@ -99,8 +100,9 @@ static int bdi_debug_stats_show(struct s
 		   "state:            %8lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
-		   K(bdi_thresh), K(dirty_thresh),
-		   K(background_thresh), nr_dirty, nr_io, nr_more_io,
+		   K(bdi_thresh), K(dirty_thresh), K(background_thresh),
+		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+		   nr_dirty, nr_io, nr_more_io,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K
 
--- linux-next.orig/mm/page-writeback.c	2010-11-17 00:18:54.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-11-17 00:18:56.000000000 +0800
@@ -1292,6 +1292,7 @@ int test_clear_page_writeback(struct pag
 						PAGECACHE_TAG_WRITEBACK);
 			if (bdi_cap_account_writeback(bdi)) {
 				__dec_bdi_stat(bdi, BDI_WRITEBACK);
+				__inc_bdi_stat(bdi, BDI_WRITTEN);
 				__bdi_writeout_inc(bdi);
 			}
 		}



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-17  4:27 [PATCH 00/13] IO-less dirty throttling v2 Wu Fengguang
                   ` (4 preceding siblings ...)
  2010-11-17  4:27 ` [PATCH 05/13] writeback: account per-bdi accumulated written pages Wu Fengguang
@ 2010-11-17  4:27 ` Wu Fengguang
  2010-11-17 23:08   ` Andrew Morton
                     ` (2 more replies)
  2010-11-17  4:27 ` [PATCH 07/13] writeback: show bdi write bandwidth in debugfs Wu Fengguang
                   ` (7 subsequent siblings)
  13 siblings, 3 replies; 71+ messages in thread
From: Wu Fengguang @ 2010-11-17  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Li Shaohua, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-bandwidth-estimation-in-flusher.patch --]
[-- Type: text/plain, Size: 6263 bytes --]

The estimation value will start from 100MB/s and adapt to the real
bandwidth in seconds.  It's pretty accurate for common filesystems.

As the first use case, it replaces the fixed 100MB/s value used for
throttle bandwidth calculation in balance_dirty_pages().

The overheads won't be high because the bdi bandwidth udpate only occurs
in >10ms intervals.

Initially it's only estimated in balance_dirty_pages() because this is
the most reliable place to get reasonable large bandwidth -- the bdi is
normally fully utilized when bdi_thresh is reached.

Then Shaohua recommends to also do it in the flusher thread, to keep the
value updated when there are only periodic/background writeback and no
tasks throttled.

The estimation cannot be done purely in the flusher thread because it's
not sufficient for NFS. NFS writeback won't block at get_request_wait(),
so tend to complete quickly. Another problem is, slow devices may take
dozens of seconds to write the initial 64MB chunk (write_bandwidth
starts with 100MB/s, this translates to 64MB nr_to_write). So it may
take more than 1 minute to adapt to the smallish bandwidth if the
bandwidth is only updated in the flusher thread.

CC: Li Shaohua <shaohua.li@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |    5 ++++
 include/linux/backing-dev.h |    2 +
 include/linux/writeback.h   |    3 ++
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |   41 +++++++++++++++++++++++++++++++++-
 5 files changed, 51 insertions(+), 1 deletion(-)

--- linux-next.orig/include/linux/backing-dev.h	2010-11-15 21:51:38.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-11-15 21:51:41.000000000 +0800
@@ -75,6 +75,8 @@ struct backing_dev_info {
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	struct prop_local_percpu completions;
+	unsigned long write_bandwidth_update_time;
+	int write_bandwidth;
 	int dirty_exceeded;
 
 	unsigned int min_ratio;
--- linux-next.orig/mm/backing-dev.c	2010-11-15 21:51:38.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-11-15 21:51:41.000000000 +0800
@@ -660,6 +660,7 @@ int bdi_init(struct backing_dev_info *bd
 			goto err;
 	}
 
+	bdi->write_bandwidth = 100 << 20;
 	bdi->dirty_exceeded = 0;
 	err = prop_local_init_percpu(&bdi->completions);
 
--- linux-next.orig/fs/fs-writeback.c	2010-11-15 21:43:51.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-11-15 21:51:41.000000000 +0800
@@ -635,6 +635,8 @@ static long wb_writeback(struct bdi_writ
 		.range_cyclic		= work->range_cyclic,
 	};
 	unsigned long oldest_jif;
+	unsigned long bw_time;
+	s64 bw_written = 0;
 	long wrote = 0;
 	long write_chunk;
 	struct inode *inode;
@@ -668,6 +670,8 @@ static long wb_writeback(struct bdi_writ
 		write_chunk = LONG_MAX;
 
 	wbc.wb_start = jiffies; /* livelock avoidance */
+	bdi_update_write_bandwidth(wb->bdi, &bw_time, &bw_written);
+
 	for (;;) {
 		/*
 		 * Stop writeback when nr_pages has been consumed
@@ -702,6 +706,7 @@ static long wb_writeback(struct bdi_writ
 		else
 			writeback_inodes_wb(wb, &wbc);
 		trace_wbc_writeback_written(&wbc, wb->bdi);
+		bdi_update_write_bandwidth(wb->bdi, &bw_time, &bw_written);
 
 		work->nr_pages -= write_chunk - wbc.nr_to_write;
 		wrote += write_chunk - wbc.nr_to_write;
--- linux-next.orig/mm/page-writeback.c	2010-11-15 21:51:38.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-11-15 21:51:41.000000000 +0800
@@ -479,6 +479,41 @@ out:
 	return 1 + int_sqrt(dirty_thresh - dirty_pages);
 }
 
+void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
+				unsigned long *bw_time,
+				s64 *bw_written)
+{
+	unsigned long written;
+	unsigned long elapsed;
+	unsigned long bw;
+	unsigned long w;
+
+	if (*bw_written == 0)
+		goto snapshot;
+
+	elapsed = jiffies - *bw_time;
+	if (elapsed < HZ/100)
+		return;
+
+	/*
+	 * When there lots of tasks throttled in balance_dirty_pages(), they
+	 * will each try to update the bandwidth for the same period, making
+	 * the bandwidth drift much faster than the desired rate (as in the
+	 * single dirtier case). So do some rate limiting.
+	 */
+	if (jiffies - bdi->write_bandwidth_update_time < elapsed)
+		goto snapshot;
+
+	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]) - *bw_written;
+	bw = (HZ * PAGE_CACHE_SIZE * written + elapsed/2) / elapsed;
+	w = min(elapsed / (HZ/100), 128UL);
+	bdi->write_bandwidth = (bdi->write_bandwidth * (1024-w) + bw * w) >> 10;
+	bdi->write_bandwidth_update_time = jiffies;
+snapshot:
+	*bw_written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
+	*bw_time = jiffies;
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -498,6 +533,8 @@ static void balance_dirty_pages(struct a
 	unsigned long pause = 0;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	unsigned long bw_time;
+	s64 bw_written = 0;
 
 	for (;;) {
 		/*
@@ -546,7 +583,7 @@ static void balance_dirty_pages(struct a
 			goto pause;
 		}
 
-		bw = 100 << 20; /* use static 100MB/s for the moment */
+		bw = bdi->write_bandwidth;
 
 		bw = bw * (bdi_thresh - bdi_dirty);
 		bw = bw / (bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
@@ -555,8 +592,10 @@ static void balance_dirty_pages(struct a
 		pause = clamp_val(pause, 1, HZ/10);
 
 pause:
+		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
 		__set_current_state(TASK_INTERRUPTIBLE);
 		io_schedule_timeout(pause);
+		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
 
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the
--- linux-next.orig/include/linux/writeback.h	2010-11-15 21:43:51.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-11-15 21:51:41.000000000 +0800
@@ -137,6 +137,9 @@ int dirty_writeback_centisecs_handler(st
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
 			       unsigned long dirty);
+void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
+				unsigned long *bw_time,
+				s64 *bw_written);
 
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 07/13] writeback: show bdi write bandwidth in debugfs
  2010-11-17  4:27 [PATCH 00/13] IO-less dirty throttling v2 Wu Fengguang
                   ` (5 preceding siblings ...)
  2010-11-17  4:27 ` [PATCH 06/13] writeback: bdi write bandwidth estimation Wu Fengguang
@ 2010-11-17  4:27 ` Wu Fengguang
  2010-11-17  4:27 ` [PATCH 08/13] writeback: quit throttling when bdi dirty pages dropped low Wu Fengguang
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 71+ messages in thread
From: Wu Fengguang @ 2010-11-17  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Theodore Tso, Peter Zijlstra, Wu Fengguang,
	Christoph Hellwig, Dave Chinner, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bandwidth-show.patch --]
[-- Type: text/plain, Size: 2034 bytes --]

Add a "BdiWriteBandwidth" entry (and indent others) in /debug/bdi/*/stats.

btw increase digital field width to 10, for keeping the possibly
huge BdiWritten number aligned at least for desktop systems.

This will break user space tools if they are dumb enough to depend on
the number of white spaces.

CC: Theodore Ts'o <tytso@mit.edu>
CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/backing-dev.c |   24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

--- linux-next.orig/mm/backing-dev.c	2010-11-15 12:52:34.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-11-15 12:52:44.000000000 +0800
@@ -87,21 +87,23 @@ static int bdi_debug_stats_show(struct s
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	seq_printf(m,
-		   "BdiWriteback:     %8lu kB\n"
-		   "BdiReclaimable:   %8lu kB\n"
-		   "BdiDirtyThresh:   %8lu kB\n"
-		   "DirtyThresh:      %8lu kB\n"
-		   "BackgroundThresh: %8lu kB\n"
-		   "BdiWritten:       %8lu kB\n"
-		   "b_dirty:          %8lu\n"
-		   "b_io:             %8lu\n"
-		   "b_more_io:        %8lu\n"
-		   "bdi_list:         %8u\n"
-		   "state:            %8lx\n",
+		   "BdiWriteback:       %10lu kB\n"
+		   "BdiReclaimable:     %10lu kB\n"
+		   "BdiDirtyThresh:     %10lu kB\n"
+		   "DirtyThresh:        %10lu kB\n"
+		   "BackgroundThresh:   %10lu kB\n"
+		   "BdiWritten:         %10lu kB\n"
+		   "BdiWriteBandwidth:  %10lu kBps\n"
+		   "b_dirty:            %10lu\n"
+		   "b_io:               %10lu\n"
+		   "b_more_io:          %10lu\n"
+		   "bdi_list:           %10u\n"
+		   "state:              %10lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
 		   K(bdi_thresh), K(dirty_thresh), K(background_thresh),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+		   (unsigned long) bdi->write_bandwidth >> 10,
 		   nr_dirty, nr_io, nr_more_io,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 08/13] writeback: quit throttling when bdi dirty pages dropped low
  2010-11-17  4:27 [PATCH 00/13] IO-less dirty throttling v2 Wu Fengguang
                   ` (6 preceding siblings ...)
  2010-11-17  4:27 ` [PATCH 07/13] writeback: show bdi write bandwidth in debugfs Wu Fengguang
@ 2010-11-17  4:27 ` Wu Fengguang
  2010-11-24 11:13   ` Peter Zijlstra
  2010-11-17  4:27 ` [PATCH 09/13] writeback: reduce per-bdi dirty threshold ramp up time Wu Fengguang
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 71+ messages in thread
From: Wu Fengguang @ 2010-11-17  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bdi-throttle-break.patch --]
[-- Type: text/plain, Size: 2582 bytes --]

Tests show that bdi_thresh may take minutes to ramp up on a typical
desktop. The time should be improvable but cannot be eliminated totally.
So when (background_thresh + dirty_thresh)/2 is reached and
balance_dirty_pages() starts to throttle the task, it will suddenly find
the (still low and ramping up) bdi_thresh is exceeded _excessively_. Here
we definitely don't want to stall the task for one minute (when it's
writing to USB stick). So introduce an alternative way to break out of
the loop when the bdi dirty/write pages has dropped by a reasonable
amount.

When dirty_background_ratio is set close to dirty_ratio, bdi_thresh may
also be constantly exceeded due to the task_dirty_limit() gap. This is
addressed by another patch to lower the background threshold when
necessary.

It will take at least 100ms before trying to break out.

Note that this opens the chance that during normal operation, a huge
number of slow dirtiers writing to a really slow device might manage to
outrun bdi_thresh. But the risk is pretty low. It takes at least one
100ms sleep loop to break out, and the global limit is still enforced.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2010-11-15 12:52:34.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-11-15 13:08:16.000000000 +0800
@@ -526,6 +526,7 @@ static void balance_dirty_pages(struct a
 {
 	long nr_reclaimable;
 	long nr_dirty, bdi_dirty;  /* = file_dirty + writeback + unstable_nfs */
+	long bdi_prev_dirty = 0;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
@@ -578,6 +579,25 @@ static void balance_dirty_pages(struct a
 				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
+		/*
+		 * bdi_thresh takes time to ramp up from the initial 0,
+		 * especially for slow devices.
+		 *
+		 * It's possible that at the moment dirty throttling starts,
+		 * 	bdi_dirty = nr_dirty
+		 * 		  = (background_thresh + dirty_thresh) / 2
+		 * 		  >> bdi_thresh
+		 * Then the task could be blocked for a dozen second to flush
+		 * all the exceeded (bdi_dirty - bdi_thresh) pages. So offer a
+		 * complementary way to break out of the loop when 250ms worth
+		 * of dirty pages have been cleaned during our pause time.
+		 */
+		if (nr_dirty < dirty_thresh &&
+		    bdi_prev_dirty - bdi_dirty >
+		    bdi->write_bandwidth >> (PAGE_CACHE_SHIFT + 2))
+			break;
+		bdi_prev_dirty = bdi_dirty;
+
 		if (bdi_dirty >= bdi_thresh) {
 			pause = HZ/10;
 			goto pause;



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 09/13] writeback: reduce per-bdi dirty threshold ramp up time
  2010-11-17  4:27 [PATCH 00/13] IO-less dirty throttling v2 Wu Fengguang
                   ` (7 preceding siblings ...)
  2010-11-17  4:27 ` [PATCH 08/13] writeback: quit throttling when bdi dirty pages dropped low Wu Fengguang
@ 2010-11-17  4:27 ` Wu Fengguang
  2010-11-24 11:15   ` Peter Zijlstra
  2010-11-17  4:27 ` [PATCH 10/13] writeback: make reasonable gap between the dirty/background thresholds Wu Fengguang
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 71+ messages in thread
From: Wu Fengguang @ 2010-11-17  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Peter Zijlstra, Richard Kennedy, Wu Fengguang,
	Christoph Hellwig, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-speedup-per-bdi-threshold-ramp-up.patch --]
[-- Type: text/plain, Size: 1521 bytes --]

Reduce the dampening for the control system, yielding faster
convergence.

Currently it converges at a snail's pace for slow devices (in order of
minutes).  For really fast storage, the convergence speed should be fine.

It makes sense to make it reasonably fast for typical desktops.

After patch, it converges in ~10 seconds for 60MB/s writes and 4GB mem.
So expect ~1s for a fast 600MB/s storage under 4GB mem, or ~4s under
16GB mem, which seems reasonable.

$ while true; do grep BdiDirtyThresh /debug/bdi/8:0/stats; sleep 1; done
BdiDirtyThresh:            0 kB
BdiDirtyThresh:       118748 kB
BdiDirtyThresh:       214280 kB
BdiDirtyThresh:       303868 kB
BdiDirtyThresh:       376528 kB
BdiDirtyThresh:       411180 kB
BdiDirtyThresh:       448636 kB
BdiDirtyThresh:       472260 kB
BdiDirtyThresh:       490924 kB
BdiDirtyThresh:       499596 kB
BdiDirtyThresh:       507068 kB
...
DirtyThresh:          530392 kB

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2010-11-15 13:08:16.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-11-15 13:08:28.000000000 +0800
@@ -125,7 +125,7 @@ static int calc_period_shift(void)
 	else
 		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
 				100;
-	return 2 + ilog2(dirty_total - 1);
+	return ilog2(dirty_total - 1) - 1;
 }
 
 /*



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 10/13] writeback: make reasonable gap between the dirty/background thresholds
  2010-11-17  4:27 [PATCH 00/13] IO-less dirty throttling v2 Wu Fengguang
                   ` (8 preceding siblings ...)
  2010-11-17  4:27 ` [PATCH 09/13] writeback: reduce per-bdi dirty threshold ramp up time Wu Fengguang
@ 2010-11-17  4:27 ` Wu Fengguang
  2010-11-24 11:18   ` Peter Zijlstra
  2010-11-17  4:27 ` [PATCH 11/13] writeback: scale down max throttle bandwidth on concurrent dirtiers Wu Fengguang
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 71+ messages in thread
From: Wu Fengguang @ 2010-11-17  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-fix-oversize-background-thresh.patch --]
[-- Type: text/plain, Size: 1304 bytes --]

The change is virtually a no-op for the majority users that use the
default 10/20 background/dirty ratios. For others don't know why they
are setting background ratio close enough to dirty ratio. Someone must
set background ratio equal to dirty ratio, but no one seems to notice or
complain that it's then silently halved under the hood..

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-11-15 13:12:50.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-11-15 13:13:42.000000000 +0800
@@ -403,8 +403,15 @@ void global_dirty_limits(unsigned long *
 	else
 		background = (dirty_background_ratio * available_memory) / 100;
 
-	if (background >= dirty)
-		background = dirty / 2;
+	/*
+	 * Ensure at least 1/4 gap between background and dirty thresholds, so
+	 * that when dirty throttling starts at (background + dirty)/2, it's at
+	 * the entrance of bdi soft throttle threshold, so as to avoid being
+	 * hard throttled.
+	 */
+	if (background > dirty - dirty * 2 / BDI_SOFT_DIRTY_LIMIT)
+		background = dirty - dirty * 2 / BDI_SOFT_DIRTY_LIMIT;
+
 	tsk = current;
 	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
 		background += background / 4;



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 11/13] writeback: scale down max throttle bandwidth on concurrent dirtiers
  2010-11-17  4:27 [PATCH 00/13] IO-less dirty throttling v2 Wu Fengguang
                   ` (9 preceding siblings ...)
  2010-11-17  4:27 ` [PATCH 10/13] writeback: make reasonable gap between the dirty/background thresholds Wu Fengguang
@ 2010-11-17  4:27 ` Wu Fengguang
  2010-11-17  4:27 ` [PATCH 12/13] writeback: add trace event for balance_dirty_pages() Wu Fengguang
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 71+ messages in thread
From: Wu Fengguang @ 2010-11-17  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-adaptive-throttle-bandwidth.patch --]
[-- Type: text/plain, Size: 3746 bytes --]

This will noticeably reduce the fluctuaions of pause time when there are
100+ concurrent dirtiers.

The more parallel dirtiers (1 dirtier => 4 dirtiers), the smaller
bandwidth each dirtier will share (bdi_bandwidth => bdi_bandwidth/4),
the less gap to the dirty limit ((C-A) => (C-B)), the less stable the
pause time will be (given the same fluctuation of bdi_dirty).

For example, if A drifts to A', its pause time may drift from 5ms to
6ms, while B to B' may drift from 50ms to 90ms.  It's much larger
fluctuations in relative ratio as well as absolute time.

Fig.1 before patch, gap (C-B) is too low to get smooth pause time

throttle_bandwidth_A = bdi_bandwidth .........o
                                              | o <= A'
                                              |   o
                                              |     o
                                              |       o
                                              |         o
throttle_bandwidth_B = bdi_bandwidth / 4 .....|...........o
                                              |           | o <= B'
----------------------------------------------+-----------+---o
                                              A           B   C

The solution is to lower the slope of the throttle line accordingly,
which makes B stabilize at some point more far away from C.

Fig.2 after patch

throttle_bandwidth_A = bdi_bandwidth .........o
                                              | o <= A'
                                              |   o
                                              |     o
    lowered max throttle bandwidth for B ===> *       o
                                              |   *     o
throttle_bandwidth_B = bdi_bandwidth / 4 .............*   o
                                              |       |   * o
----------------------------------------------+-------+-------o
                                              A       B       C

Note that C is actually different points for 1-dirty and 4-dirtiers
cases, but for easy graphing, we move them together.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-11-15 19:52:43.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-11-15 21:30:45.000000000 +0800
@@ -537,6 +537,7 @@ static void balance_dirty_pages(struct a
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
+	unsigned long task_thresh;
 	unsigned long bw;
 	unsigned long pause = 0;
 	bool dirty_exceeded = false;
@@ -566,7 +567,7 @@ static void balance_dirty_pages(struct a
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
-		bdi_thresh = task_dirty_limit(current, bdi_thresh);
+		task_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
@@ -605,14 +606,23 @@ static void balance_dirty_pages(struct a
 			break;
 		bdi_prev_dirty = bdi_dirty;
 
-		if (bdi_dirty >= bdi_thresh) {
+		if (bdi_dirty >= task_thresh) {
 			pause = HZ/10;
 			goto pause;
 		}
 
+		/*
+		 * When bdi_dirty grows closer to bdi_thresh, it indicates more
+		 * concurrent dirtiers. Proportionally lower the max throttle
+		 * bandwidth. This will resist bdi_dirty from approaching to
+		 * close to task_thresh, and help reduce fluctuations of pause
+		 * time when there are lots of dirtiers.
+		 */
 		bw = bdi->write_bandwidth;
-
 		bw = bw * (bdi_thresh - bdi_dirty);
+		bw = bw / (bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);
+
+		bw = bw * (task_thresh - bdi_dirty);
 		bw = bw / (bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
 
 		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 12/13] writeback: add trace event for balance_dirty_pages()
  2010-11-17  4:27 [PATCH 00/13] IO-less dirty throttling v2 Wu Fengguang
                   ` (10 preceding siblings ...)
  2010-11-17  4:27 ` [PATCH 11/13] writeback: scale down max throttle bandwidth on concurrent dirtiers Wu Fengguang
@ 2010-11-17  4:27 ` Wu Fengguang
  2010-11-17  4:41   ` Wu Fengguang
  2010-11-17  4:27 ` [PATCH 13/13] writeback: make nr_to_write a per-file limit Wu Fengguang
  2010-11-17 23:03 ` [PATCH 00/13] IO-less dirty throttling v2 Andrew Morton
  13 siblings, 1 reply; 71+ messages in thread
From: Wu Fengguang @ 2010-11-17  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 5877 bytes --]

Here is an interesting test to verify the theory with balance_dirty_pages()
tracing. On a partition that can do ~60MB/s, a sparse file is created and
4 rsync tasks with different write bandwidth started:

	dd if=/dev/zero of=/mnt/1T bs=1M count=1 seek=1024000
	echo 1 > /debug/tracing/events/writeback/balance_dirty_pages/enable

	rsync localhost:/mnt/1T /mnt/a --bwlimit 10000&
	rsync localhost:/mnt/1T /mnt/A --bwlimit 10000&
	rsync localhost:/mnt/1T /mnt/b --bwlimit 20000&
	rsync localhost:/mnt/1T /mnt/c --bwlimit 30000&

Trace outputs within 0.1 second, grouped by tasks:

rsync-3824  [004] 15002.076447: balance_dirty_pages: bdi=btrfs-2 weight=15% limit=130876 gap=5340 dirtied=192 pause=20

rsync-3822  [003] 15002.091701: balance_dirty_pages: bdi=btrfs-2 weight=15% limit=130777 gap=5113 dirtied=192 pause=20

rsync-3821  [006] 15002.004667: balance_dirty_pages: bdi=btrfs-2 weight=30% limit=129570 gap=3714 dirtied=64 pause=8
rsync-3821  [006] 15002.012654: balance_dirty_pages: bdi=btrfs-2 weight=30% limit=129589 gap=3733 dirtied=64 pause=8
rsync-3821  [006] 15002.021838: balance_dirty_pages: bdi=btrfs-2 weight=30% limit=129604 gap=3748 dirtied=64 pause=8
rsync-3821  [004] 15002.091193: balance_dirty_pages: bdi=btrfs-2 weight=29% limit=129583 gap=3983 dirtied=64 pause=8
rsync-3821  [004] 15002.102729: balance_dirty_pages: bdi=btrfs-2 weight=29% limit=129594 gap=3802 dirtied=64 pause=8
rsync-3821  [000] 15002.109252: balance_dirty_pages: bdi=btrfs-2 weight=29% limit=129619 gap=3827 dirtied=64 pause=8

rsync-3823  [002] 15002.009029: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128762 gap=2842 dirtied=64 pause=12
rsync-3823  [002] 15002.021598: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128813 gap=3021 dirtied=64 pause=12
rsync-3823  [003] 15002.032973: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128805 gap=2885 dirtied=64 pause=12
rsync-3823  [003] 15002.048800: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128823 gap=2967 dirtied=64 pause=12
rsync-3823  [003] 15002.060728: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128821 gap=3221 dirtied=64 pause=12
rsync-3823  [000] 15002.073152: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128825 gap=3225 dirtied=64 pause=12
rsync-3823  [005] 15002.090111: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128782 gap=3214 dirtied=64 pause=12
rsync-3823  [004] 15002.102520: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128764 gap=3036 dirtied=64 pause=12

The data vividly show that

- the heaviest writer is throttled a bit (weight=39%)

- the lighter writers run at full speed (weight=15%,15%,30%)
  rsync is smart enough to compensate for the in-kernel pause time

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   61 +++++++++++++++++++++++++++--
 mm/page-writeback.c              |    6 ++
 2 files changed, 64 insertions(+), 3 deletions(-)

--- linux-next.orig/include/trace/events/writeback.h	2010-11-15 18:59:00.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-11-15 21:31:46.000000000 +0800
@@ -147,11 +147,66 @@ DEFINE_EVENT(wbc_class, name, \
 DEFINE_WBC_EVENT(wbc_writeback_start);
 DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
-DEFINE_WBC_EVENT(wbc_balance_dirty_start);
-DEFINE_WBC_EVENT(wbc_balance_dirty_written);
-DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+#define BDP_PERCENT(a, b, c)	((__entry->a - __entry->b) * 100 * c + \
+				  __entry->bdi_limit/2) / (__entry->bdi_limit|1)
+TRACE_EVENT(balance_dirty_pages,
+
+	TP_PROTO(struct backing_dev_info *bdi,
+		 long bdi_dirty,
+		 long bdi_limit,
+		 long task_limit,
+		 long pages_dirtied,
+		 long pause),
+
+	TP_ARGS(bdi, bdi_dirty, bdi_limit, task_limit,
+		pages_dirtied, pause),
+
+	TP_STRUCT__entry(
+		__array(char,	bdi, 32)
+		__field(long,	bdi_dirty)
+		__field(long,	bdi_limit)
+		__field(long,	task_limit)
+		__field(long,	pages_dirtied)
+		__field(long,	pause)
+	),
+
+	TP_fast_assign(
+		strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
+		__entry->bdi_dirty	= bdi_dirty;
+		__entry->bdi_limit	= bdi_limit;
+		__entry->task_limit	= task_limit;
+		__entry->pages_dirtied	= pages_dirtied;
+		__entry->pause		= pause * 1000 / HZ;
+	),
+
+
+	/*
+	 *           [..................soft throttling range.........]
+	 *           ^                |<=========== bdi_gap =========>|
+	 * background_thresh          |<== task_gap ==>|
+	 * -------------------|-------+----------------|--------------|
+	 *   (bdi_limit * 7/8)^       ^bdi_dirty       ^task_limit    ^bdi_limit
+	 *
+	 * Reasonable large gaps help produce smooth pause times.
+	 */
+	TP_printk("bdi=%s bdi_dirty=%lu bdi_limit=%lu task_limit=%lu "
+		  "task_weight=%ld%% task_gap=%ld%% bdi_gap=%ld%% "
+		  "pages_dirtied=%lu pause=%lu",
+		  __entry->bdi,
+		  __entry->bdi_dirty,
+		  __entry->bdi_limit,
+		  __entry->task_limit,
+		  /* task weight: proportion of recent dirtied pages */
+		  BDP_PERCENT(bdi_limit, task_limit, TASK_SOFT_DIRTY_LIMIT),
+		  BDP_PERCENT(task_limit, bdi_dirty, TASK_SOFT_DIRTY_LIMIT),
+		  BDP_PERCENT(bdi_limit, bdi_dirty, BDI_SOFT_DIRTY_LIMIT),
+		  __entry->pages_dirtied,
+		  __entry->pause
+		  )
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
--- linux-next.orig/mm/page-writeback.c	2010-11-15 21:30:45.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-11-15 21:31:46.000000000 +0800
@@ -629,6 +629,12 @@ static void balance_dirty_pages(struct a
 		pause = clamp_val(pause, 1, HZ/10);
 
 pause:
+		trace_balance_dirty_pages(bdi,
+					  bdi_dirty,
+					  bdi_thresh,
+					  task_thresh,
+					  pages_dirtied,
+					  pause);
 		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
 		__set_current_state(TASK_INTERRUPTIBLE);
 		io_schedule_timeout(pause);



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 13/13] writeback: make nr_to_write a per-file limit
  2010-11-17  4:27 [PATCH 00/13] IO-less dirty throttling v2 Wu Fengguang
                   ` (11 preceding siblings ...)
  2010-11-17  4:27 ` [PATCH 12/13] writeback: add trace event for balance_dirty_pages() Wu Fengguang
@ 2010-11-17  4:27 ` Wu Fengguang
  2010-11-17 23:03 ` [PATCH 00/13] IO-less dirty throttling v2 Andrew Morton
  13 siblings, 0 replies; 71+ messages in thread
From: Wu Fengguang @ 2010-11-17  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-single-file-limit.patch --]
[-- Type: text/plain, Size: 1864 bytes --]

This ensures full 4MB (or larger) writeback size for large dirty files.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   11 +++++++++++
 include/linux/writeback.h |    1 +
 2 files changed, 12 insertions(+)

--- linux-next.orig/fs/fs-writeback.c	2010-11-15 19:52:39.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-11-15 21:31:50.000000000 +0800
@@ -330,6 +330,8 @@ static int
 writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 {
 	struct address_space *mapping = inode->i_mapping;
+	long per_file_limit = wbc->per_file_limit;
+	long nr_to_write;
 	unsigned dirty;
 	int ret;
 
@@ -365,8 +367,16 @@ writeback_single_inode(struct inode *ino
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode_lock);
 
+	if (per_file_limit) {
+		nr_to_write = wbc->nr_to_write;
+		wbc->nr_to_write = per_file_limit;
+	}
+
 	ret = do_writepages(mapping, wbc);
 
+	if (per_file_limit)
+		wbc->nr_to_write += nr_to_write - per_file_limit;
+
 	/*
 	 * Make sure to wait on the data before writing out the metadata.
 	 * This is important for filesystems that modify metadata on data
@@ -698,6 +708,7 @@ static long wb_writeback(struct bdi_writ
 
 		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
+		wbc.per_file_limit = write_chunk;
 		wbc.pages_skipped = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
--- linux-next.orig/include/linux/writeback.h	2010-11-15 19:52:39.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-11-15 21:31:50.000000000 +0800
@@ -43,6 +43,7 @@ struct writeback_control {
 					   extra jobs and livelock */
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
+	long per_file_limit;		/* Write this many pages for one file */
 	long pages_skipped;		/* Pages which were not written */
 
 	/*



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 12/13] writeback: add trace event for balance_dirty_pages()
  2010-11-17  4:27 ` [PATCH 12/13] writeback: add trace event for balance_dirty_pages() Wu Fengguang
@ 2010-11-17  4:41   ` Wu Fengguang
  0 siblings, 0 replies; 71+ messages in thread
From: Wu Fengguang @ 2010-11-17  4:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Dave Chinner, Theodore Ts'o,
	Chris Mason, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

Sorry, there are some lags in the document.

On Wed, Nov 17, 2010 at 12:27:32PM +0800, Wu, Fengguang wrote:
> Here is an interesting test to verify the theory with balance_dirty_pages()
> tracing. On a partition that can do ~60MB/s, a sparse file is created and
> 4 rsync tasks with different write bandwidth started:
> 
> 	dd if=/dev/zero of=/mnt/1T bs=1M count=1 seek=1024000
> 	echo 1 > /debug/tracing/events/writeback/balance_dirty_pages/enable
> 
> 	rsync localhost:/mnt/1T /mnt/a --bwlimit 10000&
> 	rsync localhost:/mnt/1T /mnt/A --bwlimit 10000&
> 	rsync localhost:/mnt/1T /mnt/b --bwlimit 20000&
> 	rsync localhost:/mnt/1T /mnt/c --bwlimit 30000&
> 
> Trace outputs within 0.1 second, grouped by tasks:
> 
> rsync-3824  [004] 15002.076447: balance_dirty_pages: bdi=btrfs-2 weight=15% limit=130876 gap=5340 dirtied=192 pause=20
> 
> rsync-3822  [003] 15002.091701: balance_dirty_pages: bdi=btrfs-2 weight=15% limit=130777 gap=5113 dirtied=192 pause=20
> 
> rsync-3821  [006] 15002.004667: balance_dirty_pages: bdi=btrfs-2 weight=30% limit=129570 gap=3714 dirtied=64 pause=8
> rsync-3821  [006] 15002.012654: balance_dirty_pages: bdi=btrfs-2 weight=30% limit=129589 gap=3733 dirtied=64 pause=8
> rsync-3821  [006] 15002.021838: balance_dirty_pages: bdi=btrfs-2 weight=30% limit=129604 gap=3748 dirtied=64 pause=8
> rsync-3821  [004] 15002.091193: balance_dirty_pages: bdi=btrfs-2 weight=29% limit=129583 gap=3983 dirtied=64 pause=8
> rsync-3821  [004] 15002.102729: balance_dirty_pages: bdi=btrfs-2 weight=29% limit=129594 gap=3802 dirtied=64 pause=8
> rsync-3821  [000] 15002.109252: balance_dirty_pages: bdi=btrfs-2 weight=29% limit=129619 gap=3827 dirtied=64 pause=8
> 
> rsync-3823  [002] 15002.009029: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128762 gap=2842 dirtied=64 pause=12
> rsync-3823  [002] 15002.021598: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128813 gap=3021 dirtied=64 pause=12
> rsync-3823  [003] 15002.032973: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128805 gap=2885 dirtied=64 pause=12
> rsync-3823  [003] 15002.048800: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128823 gap=2967 dirtied=64 pause=12
> rsync-3823  [003] 15002.060728: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128821 gap=3221 dirtied=64 pause=12
> rsync-3823  [000] 15002.073152: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128825 gap=3225 dirtied=64 pause=12
> rsync-3823  [005] 15002.090111: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128782 gap=3214 dirtied=64 pause=12
> rsync-3823  [004] 15002.102520: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128764 gap=3036 dirtied=64 pause=12

The above lines are in the old output format, but you get the idea..

> +	/*
> +	 *           [..................soft throttling range.........]
> +	 *           ^                |<=========== bdi_gap =========>|
> +	 * background_thresh          |<== task_gap ==>|

That background_thresh should be global (background+dirty)/2.

> +	 * -------------------|-------+----------------|--------------|
> +	 *   (bdi_limit * 7/8)^       ^bdi_dirty       ^task_limit    ^bdi_limit
> +	 *
> +	 * Reasonable large gaps help produce smooth pause times.
> +	 */
> +	TP_printk("bdi=%s bdi_dirty=%lu bdi_limit=%lu task_limit=%lu "
> +		  "task_weight=%ld%% task_gap=%ld%% bdi_gap=%ld%% "
> +		  "pages_dirtied=%lu pause=%lu",
> +		  __entry->bdi,
> +		  __entry->bdi_dirty,
> +		  __entry->bdi_limit,
> +		  __entry->task_limit,
> +		  /* task weight: proportion of recent dirtied pages */
> +		  BDP_PERCENT(bdi_limit, task_limit, TASK_SOFT_DIRTY_LIMIT),
> +		  BDP_PERCENT(task_limit, bdi_dirty, TASK_SOFT_DIRTY_LIMIT),
> +		  BDP_PERCENT(bdi_limit, bdi_dirty, BDI_SOFT_DIRTY_LIMIT),
> +		  __entry->pages_dirtied,
> +		  __entry->pause
> +		  )
> +);

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-11-17  4:27 ` [PATCH 01/13] writeback: IO-less balance_dirty_pages() Wu Fengguang
@ 2010-11-17 10:34   ` Minchan Kim
  2010-11-22  2:01     ` Wu Fengguang
  2010-11-17 23:08   ` Andrew Morton
  2010-11-18 13:04   ` Peter Zijlstra
  2 siblings, 1 reply; 71+ messages in thread
From: Minchan Kim @ 2010-11-17 10:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Chris Mason, Dave Chinner,
	Peter Zijlstra, Jens Axboe, Christoph Hellwig, Theodore Ts'o,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, linux-mm,
	linux-fsdevel, LKML

Hi Wu,

As you know, I am not a expert in this area.
So I hope my review can help understanding other newbie like me and
make clear this document. :)
I didn't look into the code. before it, I would like to clear your concept.

On Wed, Nov 17, 2010 at 1:27 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> As proposed by Chris, Dave and Jan, don't start foreground writeback IO
> inside balance_dirty_pages(). Instead, simply let it idle sleep for some
> time to throttle the dirtying task. In the mean while, kick off the
> per-bdi flusher thread to do background writeback IO.
>
> This patch introduces the basic framework, which will be further
> consolidated by the next patches.
>
> RATIONALS
> =========
>
> The current balance_dirty_pages() is rather IO inefficient.
>
> - concurrent writeback of multiple inodes (Dave Chinner)
>
>  If every thread doing writes and being throttled start foreground
>  writeback, it leads to N IO submitters from at least N different
>  inodes at the same time, end up with N different sets of IO being
>  issued with potentially zero locality to each other, resulting in
>  much lower elevator sort/merge efficiency and hence we seek the disk
>  all over the place to service the different sets of IO.
>  OTOH, if there is only one submission thread, it doesn't jump between
>  inodes in the same way when congestion clears - it keeps writing to
>  the same inode, resulting in large related chunks of sequential IOs
>  being issued to the disk. This is more efficient than the above
>  foreground writeback because the elevator works better and the disk
>  seeks less.
>
> - IO size too small for fast arrays and too large for slow USB sticks
>
>  The write_chunk used by current balance_dirty_pages() cannot be
>  directly set to some large value (eg. 128MB) for better IO efficiency.
>  Because it could lead to more than 1 second user perceivable stalls.
>  Even the current 4MB write size may be too large for slow USB sticks.
>  The fact that balance_dirty_pages() starts IO on itself couples the
>  IO size to wait time, which makes it hard to do suitable IO size while
>  keeping the wait time under control.
>
> For the above two reasons, it's much better to shift IO to the flusher
> threads and let balance_dirty_pages() just wait for enough time or progress.
>
> Jan Kara, Dave Chinner and me explored the scheme to let
> balance_dirty_pages() wait for enough writeback IO completions to
> safeguard the dirty limit. However it's found to have two problems:
>
> - in large NUMA systems, the per-cpu counters may have big accounting
>  errors, leading to big throttle wait time and jitters.
>
> - NFS may kill large amount of unstable pages with one single COMMIT.
>  Because NFS server serves COMMIT with expensive fsync() IOs, it is
>  desirable to delay and reduce the number of COMMITs. So it's not
>  likely to optimize away such kind of bursty IO completions, and the
>  resulted large (and tiny) stall times in IO completion based throttling.
>
> So here is a pause time oriented approach, which tries to control the
> pause time in each balance_dirty_pages() invocations, by controlling
> the number of pages dirtied before calling balance_dirty_pages(), for
> smooth and efficient dirty throttling:
>
> - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> - avoid too small pause time (less than  10ms, which burns CPU power)
> - avoid too large pause time (more than 100ms, which hurts responsiveness)
> - avoid big fluctuations of pause times
>
> For example, when doing a simple cp on ext4 with mem=4G HZ=250.
>
> before patch, the pause time fluctuates from 0 to 324ms
> (and the stall time may grow very large for slow devices)
>
> [ 1237.139962] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
> [ 1237.207489] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
> [ 1237.225190] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
> [ 1237.234488] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
> [ 1237.244692] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
> [ 1237.375231] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
> [ 1237.443035] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
> [ 1237.574630] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
> [ 1237.642394] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
> [ 1237.666320] balance_dirty_pages: write_chunk=1536 pages_written=57 pause=5
> [ 1237.973365] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=81
> [ 1238.212626] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
> [ 1238.280431] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
> [ 1238.412029] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
> [ 1238.412791] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
>
> after patch, the pause time remains stable around 32ms
>
> cp-2687  [002]  1452.237012: balance_dirty_pages: weight=56% dirtied=128 pause=8
> cp-2687  [002]  1452.246157: balance_dirty_pages: weight=56% dirtied=128 pause=8
> cp-2687  [006]  1452.253043: balance_dirty_pages: weight=56% dirtied=128 pause=8
> cp-2687  [006]  1452.261899: balance_dirty_pages: weight=57% dirtied=128 pause=8
> cp-2687  [006]  1452.268939: balance_dirty_pages: weight=57% dirtied=128 pause=8
> cp-2687  [002]  1452.276932: balance_dirty_pages: weight=57% dirtied=128 pause=8
> cp-2687  [002]  1452.285889: balance_dirty_pages: weight=57% dirtied=128 pause=8
>
> CONTROL SYSTEM
> ==============
>
> The current task_dirty_limit() adjusts bdi_dirty_limit to get
> task_dirty_limit according to the dirty "weight" of the current task,
> which is the percent of pages recently dirtied by the task. If 100%
> pages are recently dirtied by the task, it will lower bdi_dirty_limit by
> 1/8. If only 1% pages are dirtied by the task, it will return almost
> unmodified bdi_dirty_limit. In this way, a heavy dirtier will get
> blocked at task_dirty_limit=(bdi_dirty_limit-bdi_dirty_limit/8) while
> allowing a light dirtier to progress (the latter won't be blocked
> because R << B in fig.1).
>
> Fig.1 before patch, a heavy dirtier and a light dirtier
>                                                R
> ----------------------------------------------+-o---------------------------*--|
>                                              L A                           B  T
>  T: bdi_dirty_limit, as returned by bdi_dirty_limit()
>  L: T - T/8
>
>  R: bdi_reclaimable + bdi_writeback
>
>  A: task_dirty_limit for a heavy dirtier ~= R ~= L
>  B: task_dirty_limit for a light dirtier ~= T
>
> Since each process has its own dirty limit, we reuse A/B for the tasks as
> well as their dirty limits.
>
> If B is a newly started heavy dirtier, then it will slowly gain weight
> and A will lose weight.  The task_dirty_limit for A and B will be
> approaching the center of region (L, T) and eventually stabilize there.
>
> Fig.2 before patch, two heavy dirtiers converging to the same threshold
>                                                             R
> ----------------------------------------------+--------------o-*---------------|
>                                              L              A B               T

Seems good until now.
So, What's the problem if two heavy dirtiers have a same threshold?

>
> Fig.3 after patch, one heavy dirtier
>                                                |
>    throttle_bandwidth ~= bdi_bandwidth  =>     o
>                                                | o
>                                                |   o
>                                                |     o
>                                                |       o
>                                                |         o
>                                              La|           o
> ----------------------------------------------+-+-------------o----------------|
>                                                R             A                T
>  T: bdi_dirty_limit
>  A: task_dirty_limit      = T - Wa * T/16
>  La: task_throttle_thresh = A - A/16
>
>  R: bdi_dirty_pages = bdi_reclaimable + bdi_writeback ~= La
>
> Now for IO-less balance_dirty_pages(), let's do it in a "bandwidth control"
> way. In fig.3, a soft dirty limit region (La, A) is introduced. When R enters
> this region, the task may be throttled for J jiffies on every N pages it dirtied.
> Let's call (N/J) the "throttle bandwidth". It is computed by the following formula:
>
>        throttle_bandwidth = bdi_bandwidth * (A - R) / (A - La)
> where
>        A = T - Wa * T/16
>        La = A - A/16
> where Wa is task weight for A. It's 0 for very light dirtier and 1 for
> the one heavy dirtier (that consumes 100% bdi write bandwidth).  The
> task weight will be updated independently by task_dirty_inc() at
> set_page_dirty() time.


Dumb question.

I can't see the difference between old and new,
La depends on A.
A depends on Wa.
T is constant?
Then, throttle_bandwidth depends on Wa.
Wa depends on the number of dirtied pages during some interval.
So if light dirtier become heavy, at last light dirtier and heavy
dirtier will have a same weight.
It means throttle_bandwidth is same. It's a same with old result.

Please, open my eyes. :)
Thanks for the great work.

>
> When R < La, we don't throttle it at all.
> When R > A, the code will detect the negativeness and choose to pause
> 100ms (the upper pause boundary), then loop over again.




-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 03/13] writeback: per-task rate limit on balance_dirty_pages()
  2010-11-17  4:27 ` [PATCH 03/13] writeback: per-task rate limit on balance_dirty_pages() Wu Fengguang
@ 2010-11-17 14:39   ` Wu Fengguang
  2010-11-24 10:23   ` Peter Zijlstra
  1 sibling, 0 replies; 71+ messages in thread
From: Wu Fengguang @ 2010-11-17 14:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Dave Chinner, Theodore Ts'o,
	Chris Mason, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

> +	if (pause == 0 && nr_dirty < background_thresh)
> +		current->nr_dirtied_pause = ratelimit_pages(bdi);
> +	else if (pause == 1)
> +		current->nr_dirtied_pause += current->nr_dirtied_pause >> 5;

Sorry here is a bug fix for the above line, it's also pushed to the
git tree.

Thanks,
Fengguang
---
Subject: writeback: fix increasement of nr_dirtied_pause
Date: Wed Nov 17 22:31:26 CST 2010

Fix a bug that

	current->nr_dirtied_pause += current->nr_dirtied_pause >> 5;

does not effectively increase nr_dirtied_pause when it's <= 32.
Thus nr_dirtied_pause may never grow up..

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2010-11-17 22:31:09.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-11-17 22:31:23.000000000 +0800
@@ -662,7 +662,7 @@ pause:
 	if (pause == 0 && nr_dirty < background_thresh)
 		current->nr_dirtied_pause = ratelimit_pages(bdi);
 	else if (pause == 1)
-		current->nr_dirtied_pause += current->nr_dirtied_pause >> 5;
+		current->nr_dirtied_pause += current->nr_dirtied_pause / 32 + 1;
 	else if (pause >= HZ/10)
 		/*
 		 * when repeated, writing 1 page per 100ms on slow devices,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 00/13] IO-less dirty throttling v2
  2010-11-17  4:27 [PATCH 00/13] IO-less dirty throttling v2 Wu Fengguang
                   ` (12 preceding siblings ...)
  2010-11-17  4:27 ` [PATCH 13/13] writeback: make nr_to_write a per-file limit Wu Fengguang
@ 2010-11-17 23:03 ` Andrew Morton
  2010-11-18  2:06   ` Dave Chinner
  13 siblings, 1 reply; 71+ messages in thread
From: Andrew Morton @ 2010-11-17 23:03 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Christoph Hellwig, Dave Chinner, Theodore Ts'o,
	Chris Mason, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 17 Nov 2010 12:27:20 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and
> improves IO throughput from 38MB/s to 42MB/s.

The changes in CPU consumption are remarkable.  I've looked through the
changelogs but cannot find mention of where all that time was being
spent?


How well have these changes been tested with NFS?


The changes are complex and will probably do Bad Things for some
people.  Does the code implement sufficient
debug/reporting/instrumentation to enable you to diagnose, understand
and fix people's problems in the minimum possible time?  If not, please
add that stuff.  Just go nuts with it.  Put it in debugfs, add /*
DELETEME */ comments and we can pull it all out again in half a year or
so.

Or perhaps litter the code with temporary tracepoints, provided we can
come up with a way for our testers to trivially gather their output.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-11-17  4:27 ` [PATCH 01/13] writeback: IO-less balance_dirty_pages() Wu Fengguang
  2010-11-17 10:34   ` Minchan Kim
@ 2010-11-17 23:08   ` Andrew Morton
  2010-11-18 13:04   ` Peter Zijlstra
  2 siblings, 0 replies; 71+ messages in thread
From: Andrew Morton @ 2010-11-17 23:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Chris Mason, Dave Chinner, Peter Zijlstra, Jens Axboe,
	Christoph Hellwig, Theodore Ts'o, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 17 Nov 2010 12:27:21 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> Since the task will be soft throttled earlier than before, it may be
> perceived by end users as performance "slow down" if his application
> happens to dirty more than ~15% memory.

writeback has always had these semi-bogus assumptions that all pages
are the same, and it can sometimes go very wrong.

A chronic case would be a 4GB i386 machine where only 1/4 of memory is
useable for GFP_KERNEL allocations, filesystem metadata and /dev/sdX
pagecache.

When you think about it, a lot of the throttling work being done in
writeback is really being done on behalf of the page allocator (and
hence page reclaim).  But what happens if the workload is mainly
hammering away at ZONE_NORMAL, but writeback is considering ZONE_NORMAL
to be the same thing as ZONE_HIGHMEM?

Or vice versa, where page-dirtyings are all happening in lowmem?  Can
writeback then think that there are plenty of clean pages (because it's
looking at highmem as well) so little or no throttling is happening? 
If so, what effect does this have upon GFP_KERNEL/GFP_USER allocation?

And bear in mind that the user can tune the dirty levels.  If they're
set to 10% on a machine on which 25% of memory is lowmem then ill
effects might be rare.  But if the user tweaks the thresholds to 30%
then can we get into problems?  Such as a situation where 100% of
lowmem is dirty and throttling isn't cutting in?



So please have a think about that and see if you can think of ways in
which this assumption can cause things to go bad.  I'd suggest
writing some targetted tests which write to /dev/sdX (to generate
lowmem-only dirty pages) and which read from /dev/sdX (to request
allocation of lowmem pages).  Run these tests in conjunction with tests
which exercise the highmem zone as well and check that everything
behaves as expected.

Of course, this all assumes that you have a 4GB i386 box :( It's almost
getting to the stage where we need a fake-zone-highmem option for
x86_64 boxes just so we can test this stuff.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-17  4:27 ` [PATCH 06/13] writeback: bdi write bandwidth estimation Wu Fengguang
@ 2010-11-17 23:08   ` Andrew Morton
  2010-11-17 23:24     ` Peter Zijlstra
  2010-11-18  6:51     ` Wu Fengguang
  2010-11-24 10:58   ` Peter Zijlstra
  2010-11-24 11:05   ` Peter Zijlstra
  2 siblings, 2 replies; 71+ messages in thread
From: Andrew Morton @ 2010-11-17 23:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Li Shaohua, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 17 Nov 2010 12:27:26 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> +	w = min(elapsed / (HZ/100), 128UL);

I did try setting HZ=10 many years ago, and the kernel blew up.

I do recall hearing of people who set HZ very low, perhaps because
their huge machines were seeing performance prolems when the timer tick
went off.  Probably there's no need to do that any more.

But still, we shouldn't hard-wire the (HZ >= 100) assumption if we
don't absolutely need to, and I don't think it is absolutely needed
here.  

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-17 23:08   ` Andrew Morton
@ 2010-11-17 23:24     ` Peter Zijlstra
  2010-11-17 23:38       ` Andrew Morton
  2010-11-18  6:51     ` Wu Fengguang
  1 sibling, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2010-11-17 23:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, Jan Kara, Li Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 2010-11-17 at 15:08 -0800, Andrew Morton wrote:
> On Wed, 17 Nov 2010 12:27:26 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > +     w = min(elapsed / (HZ/100), 128UL);
> 
> I did try setting HZ=10 many years ago, and the kernel blew up.
> 
> I do recall hearing of people who set HZ very low, perhaps because
> their huge machines were seeing performance prolems when the timer tick
> went off.  Probably there's no need to do that any more.
> 
> But still, we shouldn't hard-wire the (HZ >= 100) assumption if we
> don't absolutely need to, and I don't think it is absolutely needed
> here.  

People who do cpu bring-up on very slow FPGAs also lower HZ as far as
possible.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-17 23:24     ` Peter Zijlstra
@ 2010-11-17 23:38       ` Andrew Morton
  2010-11-17 23:43         ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Andrew Morton @ 2010-11-17 23:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, Jan Kara, Li Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Thu, 18 Nov 2010 00:24:59 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Wed, 2010-11-17 at 15:08 -0800, Andrew Morton wrote:
> > On Wed, 17 Nov 2010 12:27:26 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > +     w = min(elapsed / (HZ/100), 128UL);
> > 
> > I did try setting HZ=10 many years ago, and the kernel blew up.
> > 
> > I do recall hearing of people who set HZ very low, perhaps because
> > their huge machines were seeing performance prolems when the timer tick
> > went off.  Probably there's no need to do that any more.
> > 
> > But still, we shouldn't hard-wire the (HZ >= 100) assumption if we
> > don't absolutely need to, and I don't think it is absolutely needed
> > here.  
> 
> People who do cpu bring-up on very slow FPGAs also lower HZ as far as
> possible.

	grep -r "[^a-zA-Z0-9_]HZ[ ]*/[ ]*100[^0-9]" .

:-(

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-17 23:38       ` Andrew Morton
@ 2010-11-17 23:43         ` Peter Zijlstra
  0 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2010-11-17 23:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, Jan Kara, Li Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 2010-11-17 at 15:38 -0800, Andrew Morton wrote:
> On Thu, 18 Nov 2010 00:24:59 +0100
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > On Wed, 2010-11-17 at 15:08 -0800, Andrew Morton wrote:
> > > On Wed, 17 Nov 2010 12:27:26 +0800
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > +     w = min(elapsed / (HZ/100), 128UL);
> > > 
> > > I did try setting HZ=10 many years ago, and the kernel blew up.
> > > 
> > > I do recall hearing of people who set HZ very low, perhaps because
> > > their huge machines were seeing performance prolems when the timer tick
> > > went off.  Probably there's no need to do that any more.
> > > 
> > > But still, we shouldn't hard-wire the (HZ >= 100) assumption if we
> > > don't absolutely need to, and I don't think it is absolutely needed
> > > here.  
> > 
> > People who do cpu bring-up on very slow FPGAs also lower HZ as far as
> > possible.
> 
> 	grep -r "[^a-zA-Z0-9_]HZ[ ]*/[ ]*100[^0-9]" .

Maybe they've got a patch-kit somewhere,. I'll ask around. It would be
nice to have that sorted.



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 00/13] IO-less dirty throttling v2
  2010-11-17 23:03 ` [PATCH 00/13] IO-less dirty throttling v2 Andrew Morton
@ 2010-11-18  2:06   ` Dave Chinner
  2010-11-18  2:09     ` Andrew Morton
  0 siblings, 1 reply; 71+ messages in thread
From: Dave Chinner @ 2010-11-18  2:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, Jan Kara, Christoph Hellwig, Theodore Ts'o,
	Chris Mason, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, Nov 17, 2010 at 03:03:30PM -0800, Andrew Morton wrote:
> On Wed, 17 Nov 2010 12:27:20 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and
> > improves IO throughput from 38MB/s to 42MB/s.
> 
> The changes in CPU consumption are remarkable.  I've looked through the
> changelogs but cannot find mention of where all that time was being
> spent?

In the writeback path, mostly because every CPU is trying to run
writeback at the same time and causing contention on locks and
shared structures in the writeback path. That no longer happens
because writeback is only happening from one thread instead of from
all CPUs at once.

This is one of the reasons why I want this series to be sorted out
before we start to consider scalability of the writeback lists and
locking - controlling the level of writeback parallelism provides a
major reduction in writeback lock contention and indicates that the
next bottleneck we really need to solve is bdi-flusher thread
parallelism...

> How well have these changes been tested with NFS?

Next on my list (just doing some ext4 sanity testing).

> The changes are complex and will probably do Bad Things for some
> people.  Does the code implement sufficient
> debug/reporting/instrumentation to enable you to diagnose, understand
> and fix people's problems in the minimum possible time?  If not, please
> add that stuff.  Just go nuts with it.  Put it in debugfs, add /*
> DELETEME */ comments and we can pull it all out again in half a year or
> so.
> 
> Or perhaps litter the code with temporary tracepoints, provided we can
> come up with a way for our testers to trivially gather their output.

I think tracepoints are the way to go - I've been asking XFS users
to send me trace dumps for anyhting non-trivial recently. I've been
able to understand the cause of their problems without having to
reproduce the problem locally, which has been a big help....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 00/13] IO-less dirty throttling v2
  2010-11-18  2:06   ` Dave Chinner
@ 2010-11-18  2:09     ` Andrew Morton
  2010-11-18  3:21       ` Dave Chinner
  2010-11-24 11:12       ` Avi Kivity
  0 siblings, 2 replies; 71+ messages in thread
From: Andrew Morton @ 2010-11-18  2:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, Jan Kara, Christoph Hellwig, Theodore Ts'o,
	Chris Mason, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Thu, 18 Nov 2010 13:06:40 +1100 Dave Chinner <david@fromorbit.com> wrote:

> On Wed, Nov 17, 2010 at 03:03:30PM -0800, Andrew Morton wrote:
> > On Wed, 17 Nov 2010 12:27:20 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and
> > > improves IO throughput from 38MB/s to 42MB/s.
> > 
> > The changes in CPU consumption are remarkable.  I've looked through the
> > changelogs but cannot find mention of where all that time was being
> > spent?
> 
> In the writeback path, mostly because every CPU is trying to run
> writeback at the same time and causing contention on locks and
> shared structures in the writeback path. That no longer happens
> because writeback is only happening from one thread instead of from
> all CPUs at once.

It'd be nice to see this quantified.  Partly because handing things
over to kernel threads uncurs extra overhead - scheduling cost and CPU
cache footprint.

But mainly because we're taking the work accounting away from the user
who caused it and crediting it to the kernel thread instead, and that's
an actively *bad* thing to do.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 00/13] IO-less dirty throttling v2
  2010-11-18  2:09     ` Andrew Morton
@ 2010-11-18  3:21       ` Dave Chinner
  2010-11-18  3:34         ` Andrew Morton
  2010-11-24 11:12       ` Avi Kivity
  1 sibling, 1 reply; 71+ messages in thread
From: Dave Chinner @ 2010-11-18  3:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, Jan Kara, Christoph Hellwig, Theodore Ts'o,
	Chris Mason, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, Nov 17, 2010 at 06:09:12PM -0800, Andrew Morton wrote:
> On Thu, 18 Nov 2010 13:06:40 +1100 Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Wed, Nov 17, 2010 at 03:03:30PM -0800, Andrew Morton wrote:
> > > On Wed, 17 Nov 2010 12:27:20 +0800
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and
> > > > improves IO throughput from 38MB/s to 42MB/s.
> > > 
> > > The changes in CPU consumption are remarkable.  I've looked through the
> > > changelogs but cannot find mention of where all that time was being
> > > spent?
> > 
> > In the writeback path, mostly because every CPU is trying to run
> > writeback at the same time and causing contention on locks and
> > shared structures in the writeback path. That no longer happens
> > because writeback is only happening from one thread instead of from
> > all CPUs at once.
> 
> It'd be nice to see this quantified.  Partly because handing things
> over to kernel threads uncurs extra overhead - scheduling cost and CPU
> cache footprint.

Sure, but in this case, the scheduling cost is much lower than
actually doing writeback of 1500 pages. The CPU cache footprint of
the syscall is also greatly reduced as well because we don't go down
the writeback path. That shows up in the fact that the "app
overhead" measured by fs_mark goes down significantly with this
patch series (30-50% reduction) - it's doing the same work, but it's
taking much less wall time....

And if you are after lock contention numbers, I have quantified it
though I do not have saved lock_stat numbers at hand.  Running the
current inode_lock breakup patchset and the fs_mark workload (8-way
parallel create of 1 byte files), lock_stat shows the
inode_wb_list_lock as the hottest lock in the system (more trafficed
and much more contended than the dcache_lock), along with the
inode->i_lock being the most trafficed.

Running `perf top -p <pid of bdi-flusher>` showed it spending 30-40%
of it's time in __ticket_spin_lock. I saw the same thing with every
fs_mark process also showing 30-40% of it's time in
__ticket_spin_lock. Every process also showed a good chunk of time
in the writeback path. Overall, the fsmark processes showed a CPU
consumption of ~620% CPU, with the bdi-flusher at 80% of a CPU and
kswapd at 80% of CPU.

With the patchset, all that spin lock time is gone from the profiles
(down to about 2%) as is the writeback path (except fo the
bdi-flusher, which is all writeback path). Overall, we have fsmark
processes showing 250% CPU, the bdi-flusher at 80% of a cpu, and
kswapd at about 20% of a CPU, with over 400% idle time.

IOWs, we've traded off 3-4 CPUs worth of spinlock contention and a
flusher thread running at 80% CPU for a flusher thread that runs at
80% CPU doing the same amount of work. To me, that says the cost of
scheduling is orders of magnitude lower than the cost of the current
code...

> But mainly because we're taking the work accounting away from the user
> who caused it and crediting it to the kernel thread instead, and that's
> an actively *bad* thing to do.

The current foreground writeback is doing work on behalf of the
system (i.e. doing background writeback) and therefore crediting it
to the user process. That seems wrong to me; it's hiding the
overhead of system tasks in user processes.

IMO, time spent doing background writeback should not be creditted
to user processes - writeback caching is a function of the OS and
it's overhead should be accounted as such. Indeed, nobody has
realised (until now) just how inefficient it really is because of
the fact that the overhead is mostly hidden in user process system
time.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 00/13] IO-less dirty throttling v2
  2010-11-18  3:21       ` Dave Chinner
@ 2010-11-18  3:34         ` Andrew Morton
  2010-11-18  7:27           ` Dave Chinner
  0 siblings, 1 reply; 71+ messages in thread
From: Andrew Morton @ 2010-11-18  3:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, Jan Kara, Christoph Hellwig, Theodore Ts'o,
	Chris Mason, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Thu, 18 Nov 2010 14:21:41 +1100 Dave Chinner <david@fromorbit.com> wrote:

> > But mainly because we're taking the work accounting away from the user
> > who caused it and crediting it to the kernel thread instead, and that's
> > an actively *bad* thing to do.
> 
> The current foreground writeback is doing work on behalf of the
> system (i.e. doing background writeback) and therefore crediting it
> to the user process. That seems wrong to me; it's hiding the
> overhead of system tasks in user processes.
> 
> IMO, time spent doing background writeback should not be creditted
> to user processes - writeback caching is a function of the OS and
> it's overhead should be accounted as such.

bah, that's bunk.  Using this logic, _no_ time spent in the kernel
should be accounted to the user process and we may as well do away with
system-time accounting altogether.

If userspace performs some action which causes the kernel to consume
CPU resources, that consumption should be accounted to that process.

Yes, writeback can be inaccurate because process A will write back
process B's stuff, but that should even out on average, and it's more
accurate than saying "zero".

> Indeed, nobody has
> realised (until now) just how inefficient it really is because of
> the fact that the overhead is mostly hidden in user process system
> time.

"hidden"?  You do "time dd" and look at the output!

_now_ it's hidden.  You do "time dd" and whee, no system time!  You
need to do complex gymnastics with kernel thread accounting to work out
the real cost of your dd.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-17 23:08   ` Andrew Morton
  2010-11-17 23:24     ` Peter Zijlstra
@ 2010-11-18  6:51     ` Wu Fengguang
  1 sibling, 0 replies; 71+ messages in thread
From: Wu Fengguang @ 2010-11-18  6:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Li, Shaohua, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Thu, Nov 18, 2010 at 07:08:37AM +0800, Andrew Morton wrote:
> On Wed, 17 Nov 2010 12:27:26 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > +	w = min(elapsed / (HZ/100), 128UL);
> 
> I did try setting HZ=10 many years ago, and the kernel blew up.
> 
> I do recall hearing of people who set HZ very low, perhaps because
> their huge machines were seeing performance prolems when the timer tick
> went off.  Probably there's no need to do that any more.
> 
> But still, we shouldn't hard-wire the (HZ >= 100) assumption if we
> don't absolutely need to, and I don't think it is absolutely needed
> here.  

Fair enough.  Here is the fix. The other (HZ/10) will be addressed by
another patch that increase it to MAX_PAUSE=max(HZ/5, 1).

Thanks,
Fengguang
---

Subject: writeback: prevent divide error on tiny HZ
Date: Thu Nov 18 12:19:56 CST 2010

As suggested by Andrew and Peter:

I do recall hearing of people who set HZ very low, perhaps because their
huge machines were seeing performance prolems when the timer tick went
off.  Probably there's no need to do that any more.

But still, we shouldn't hard-wire the (HZ >= 100) assumption if we don't
absolutely need to, and I don't think it is absolutely needed here.

People who do cpu bring-up on very slow FPGAs also lower HZ as far as
possible.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-11-18 12:35:18.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-11-18 12:35:38.000000000 +0800
@@ -490,6 +490,7 @@ void bdi_update_write_bandwidth(struct b
 				unsigned long *bw_time,
 				s64 *bw_written)
 {
+	const unsigned long unit_time = max(HZ/100, 1);
 	unsigned long written;
 	unsigned long elapsed;
 	unsigned long bw;
@@ -499,7 +500,7 @@ void bdi_update_write_bandwidth(struct b
 		goto snapshot;
 
 	elapsed = jiffies - *bw_time;
-	if (elapsed < HZ/100)
+	if (elapsed < unit_time)
 		return;
 
 	/*
@@ -513,7 +514,7 @@ void bdi_update_write_bandwidth(struct b
 
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]) - *bw_written;
 	bw = (HZ * PAGE_CACHE_SIZE * written + elapsed/2) / elapsed;
-	w = min(elapsed / (HZ/100), 128UL);
+	w = min(elapsed / unit_time, 128UL);
 	bdi->write_bandwidth = (bdi->write_bandwidth * (1024-w) + bw * w) >> 10;
 	bdi->write_bandwidth_update_time = jiffies;
 snapshot:

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 00/13] IO-less dirty throttling v2
  2010-11-18  3:34         ` Andrew Morton
@ 2010-11-18  7:27           ` Dave Chinner
  2010-11-18  7:33             ` Andrew Morton
  0 siblings, 1 reply; 71+ messages in thread
From: Dave Chinner @ 2010-11-18  7:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, Jan Kara, Christoph Hellwig, Theodore Ts'o,
	Chris Mason, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, Nov 17, 2010 at 07:34:31PM -0800, Andrew Morton wrote:
> On Thu, 18 Nov 2010 14:21:41 +1100 Dave Chinner <david@fromorbit.com> wrote:
> 
> > > But mainly because we're taking the work accounting away from the user
> > > who caused it and crediting it to the kernel thread instead, and that's
> > > an actively *bad* thing to do.
> > 
> > The current foreground writeback is doing work on behalf of the
> > system (i.e. doing background writeback) and therefore crediting it
> > to the user process. That seems wrong to me; it's hiding the
> > overhead of system tasks in user processes.
> > 
> > IMO, time spent doing background writeback should not be creditted
> > to user processes - writeback caching is a function of the OS and
> > it's overhead should be accounted as such.
> 
> bah, that's bunk.  Using this logic, _no_ time spent in the kernel
> should be accounted to the user process and we may as well do away with
> system-time accounting altogether.

That's a rather extreme intepretation and not what I meant at all.
:/

> If userspace performs some action which causes the kernel to consume
> CPU resources, that consumption should be accounted to that process.

Which is pretty much impossible for work deferred to background
kernel threads. On a vanilla kernel (without this series), the CPU
dd consumes on ext4 is (output from top):

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
9875 dave      20   0 10708 1604  516 R   74  0.0   0:05.58 dd
9876 root      20   0     0    0    0 D   30  0.0   0:01.76 flush-253:16
 561 root      20   0     0    0    0 R   17  0.0  21:45.06 kswapd0
8820 root      20   0     0    0    0 S   10  0.0  15:58.61 jbd2/vdb-8

The dd is consuming 75% cpu time, all in system, including
foreground writeback.  We've got 30% being consumed by the bdi
flusher doing background writeback.  We've got 17% consumed by
kswapd reclaiming memory. And finally, 10% is consumed by a jbd2
thread.  So, all up, the dd is triggering ~130% CPU usage, but time
only reports:

$ /usr/bin/time dd if=/dev/zero of=/mnt/scratch/test1 bs=1024k count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 17.8536 s, 587 MB/s
0.00user 12.11system 0:17.91elapsed 67%CPU (0avgtext+0avgdata 7296maxresident)k
0inputs+0outputs (11major+506minor)pagefaults 0swaps

67% CPU usage for dd. IOWs, half of the CPU time associated with a
dd write is already accounted to kernel threads in a current kernel.

> Yes, writeback can be inaccurate because process A will write back
> process B's stuff, but that should even out on average, and it's more
> accurate than saying "zero".

Sure, but it still doesn't account for the flusher, jbd or kswapd
CPU usage that is still being chewed up. That's still missing from
'time dd'.

> > Indeed, nobody has
> > realised (until now) just how inefficient it really is because of
> > the fact that the overhead is mostly hidden in user process system
> > time.
> 
> "hidden"?  You do "time dd" and look at the output!
> 
> _now_ it's hidden.  You do "time dd" and whee, no system time!

What I meant is that the cost of foreground writeback was hidden in
the process system time. Now we have separated the two of them, we
can see exactly how much it was costing us because it is no longer
hidden inside the process system time.

Besides, there's plenty of system time still accounted to the dd.
It's now just the CPU time spent writing data into the page cache,
rather than write + writeback CPU time.

> You
> need to do complex gymnastics with kernel thread accounting to work out
> the real cost of your dd.

Yup, that's what we've been doing for years. ;) e.g from the high
bandwidth IO paper I presented at OLS 2006, section 5.3 "kswapd and
pdflush":

	"While running single threaded tests, it was clear
	that there was something running in the back-
	ground that was using more CPU time than the
	writer process and pdflush combined. A sin-
	gle threaded read from disk consuming a single
	CPU was consuming 10-15% of a CPU on each
	node running memory reclaim via kswapd. For
	a single threaded write, this was closer to 30%
	of a CPU per node. On our twelve node ma-
	chine, this meant that we were using between
	1.5 and 3.5 CPUs to reclaim memory being al-
	located by a single CPU."

(http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-presentation.pdf)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 00/13] IO-less dirty throttling v2
  2010-11-18  7:27           ` Dave Chinner
@ 2010-11-18  7:33             ` Andrew Morton
  2010-11-19  3:11               ` Dave Chinner
  0 siblings, 1 reply; 71+ messages in thread
From: Andrew Morton @ 2010-11-18  7:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, Jan Kara, Christoph Hellwig, Theodore Ts'o,
	Chris Mason, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Thu, 18 Nov 2010 18:27:06 +1100 Dave Chinner <david@fromorbit.com> wrote:

> > > Indeed, nobody has
> > > realised (until now) just how inefficient it really is because of
> > > the fact that the overhead is mostly hidden in user process system
> > > time.
> > 
> > "hidden"?  You do "time dd" and look at the output!
> > 
> > _now_ it's hidden.  You do "time dd" and whee, no system time!
> 
> What I meant is that the cost of foreground writeback was hidden in
> the process system time. Now we have separated the two of them, we
> can see exactly how much it was costing us because it is no longer
> hidden inside the process system time.

About a billion years ago I wrote the "cyclesoak" thingy which measures
CPU utilisation the other way around: run a lowest-priority process on
each CPU in the background, while running your workload, then find out
how much CPU time cyclesoak *didn't* consume.  That way you account for
everything: user time, system time, kernel threads, interrupts,
softirqs, etc.  It turned out to be pretty accurate, despite the
then-absence of SCHED_IDLE.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-11-17  4:27 ` [PATCH 01/13] writeback: IO-less balance_dirty_pages() Wu Fengguang
  2010-11-17 10:34   ` Minchan Kim
  2010-11-17 23:08   ` Andrew Morton
@ 2010-11-18 13:04   ` Peter Zijlstra
  2010-11-18 13:26     ` Wu Fengguang
       [not found]     ` <20101129151719.GA30590@localhost>
  2 siblings, 2 replies; 71+ messages in thread
From: Peter Zijlstra @ 2010-11-18 13:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Chris Mason, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Theodore Ts'o, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML, tglx

On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> - avoid too small pause time (less than  10ms, which burns CPU power)
> - avoid too large pause time (more than 100ms, which hurts responsiveness)
> - avoid big fluctuations of pause times 

If you feel like playing with sub-jiffies timeouts (a way to avoid that
HZ=>100 assumption), the below (totally untested) patch might be of
help..


---
Subject: hrtimer: Provide io_schedule_timeout*() functions

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/hrtimer.h |    7 +++++++
 kernel/hrtimer.c        |   15 +++++++++++++++
 kernel/sched.c          |   17 +++++++++++++++++
 3 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index dd9954b..9e0f67e 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -419,6 +419,13 @@ extern long hrtimer_nanosleep_restart(struct restart_block *restart_block);
 extern void hrtimer_init_sleeper(struct hrtimer_sleeper *sl,
 				 struct task_struct *tsk);
 
+extern int io_schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
+						const enum hrtimer_mode mode);
+extern int io_schedule_hrtimeout_range_clock(ktime_t *expires,
+		unsigned long delta, const enum hrtimer_mode mode, int clock);
+extern int io_schedule_hrtimeout(ktime_t *expires, const enum hrtimer_mode mode);
+
+
 extern int schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
 						const enum hrtimer_mode mode);
 extern int schedule_hrtimeout_range_clock(ktime_t *expires,
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 72206cf..ef2d93c 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1838,6 +1838,14 @@ int __sched schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
 }
 EXPORT_SYMBOL_GPL(schedule_hrtimeout_range);
 
+int __sched io_schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
+				     const enum hrtimer_mode mode)
+{
+	return io_schedule_hrtimeout_range_clock(expires, delta, mode,
+					      CLOCK_MONOTONIC);
+}
+EXPORT_SYMBOL_GPL(io_schedule_hrtimeout_range);
+
 /**
  * schedule_hrtimeout - sleep until timeout
  * @expires:	timeout value (ktime_t)
@@ -1866,3 +1874,10 @@ int __sched schedule_hrtimeout(ktime_t *expires,
 	return schedule_hrtimeout_range(expires, 0, mode);
 }
 EXPORT_SYMBOL_GPL(schedule_hrtimeout);
+
+int __sched io_schedule_hrtimeout(ktime_t *expires,
+			       const enum hrtimer_mode mode)
+{
+	return io_schedule_hrtimeout_range(expires, 0, mode);
+}
+EXPORT_SYMBOL_GPL(io_schedule_hrtimeout);
diff --git a/kernel/sched.c b/kernel/sched.c
index d5564a8..ac84455 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5303,6 +5303,23 @@ long __sched io_schedule_timeout(long timeout)
 	return ret;
 }
 
+int __sched
+io_schedule_hrtimeout_range_clock(ktime_t *expires, unsigned long delta,
+			       const enum hrtimer_mode mode, int clock)
+{
+	struct rq *rq = raw_rq();
+	long ret;
+
+	delayacct_blkio_start();
+	atomic_inc(&rq->nr_iowait);
+	current->in_iowait = 1;
+	ret = schedule_hrtimeout_range_clock(expires, delta, mode, clock);
+	current->in_iowait = 0;
+	atomic_dec(&rq->nr_iowait);
+	delayacct_blkio_end();
+	return ret;
+}
+
 /**
  * sys_sched_get_priority_max - return maximum RT priority.
  * @policy: scheduling class.


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-11-18 13:04   ` Peter Zijlstra
@ 2010-11-18 13:26     ` Wu Fengguang
  2010-11-18 13:40       ` Peter Zijlstra
       [not found]     ` <20101129151719.GA30590@localhost>
  1 sibling, 1 reply; 71+ messages in thread
From: Wu Fengguang @ 2010-11-18 13:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Chris Mason, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Theodore Ts'o, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML, tglx

On Thu, Nov 18, 2010 at 09:04:34PM +0800, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> > - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> > - avoid too small pause time (less than  10ms, which burns CPU power)
> > - avoid too large pause time (more than 100ms, which hurts responsiveness)
> > - avoid big fluctuations of pause times 
> 
> If you feel like playing with sub-jiffies timeouts (a way to avoid that
> HZ=>100 assumption), the below (totally untested) patch might be of
> help..

Assuming there are HZ=10 users.

- when choosing such a coarse granularity, do they really care about
  responsiveness? :)

- will the use of hrtimer add a little code size and/or runtime
  overheads, and hence hurt the majority HZ=100 users?

Thanks,
Fengguang

> 
> ---
> Subject: hrtimer: Provide io_schedule_timeout*() functions
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/hrtimer.h |    7 +++++++
>  kernel/hrtimer.c        |   15 +++++++++++++++
>  kernel/sched.c          |   17 +++++++++++++++++
>  3 files changed, 39 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
> index dd9954b..9e0f67e 100644
> --- a/include/linux/hrtimer.h
> +++ b/include/linux/hrtimer.h
> @@ -419,6 +419,13 @@ extern long hrtimer_nanosleep_restart(struct restart_block *restart_block);
>  extern void hrtimer_init_sleeper(struct hrtimer_sleeper *sl,
>  				 struct task_struct *tsk);
>  
> +extern int io_schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
> +						const enum hrtimer_mode mode);
> +extern int io_schedule_hrtimeout_range_clock(ktime_t *expires,
> +		unsigned long delta, const enum hrtimer_mode mode, int clock);
> +extern int io_schedule_hrtimeout(ktime_t *expires, const enum hrtimer_mode mode);
> +
> +
>  extern int schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
>  						const enum hrtimer_mode mode);
>  extern int schedule_hrtimeout_range_clock(ktime_t *expires,
> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
> index 72206cf..ef2d93c 100644
> --- a/kernel/hrtimer.c
> +++ b/kernel/hrtimer.c
> @@ -1838,6 +1838,14 @@ int __sched schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
>  }
>  EXPORT_SYMBOL_GPL(schedule_hrtimeout_range);
>  
> +int __sched io_schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
> +				     const enum hrtimer_mode mode)
> +{
> +	return io_schedule_hrtimeout_range_clock(expires, delta, mode,
> +					      CLOCK_MONOTONIC);
> +}
> +EXPORT_SYMBOL_GPL(io_schedule_hrtimeout_range);
> +
>  /**
>   * schedule_hrtimeout - sleep until timeout
>   * @expires:	timeout value (ktime_t)
> @@ -1866,3 +1874,10 @@ int __sched schedule_hrtimeout(ktime_t *expires,
>  	return schedule_hrtimeout_range(expires, 0, mode);
>  }
>  EXPORT_SYMBOL_GPL(schedule_hrtimeout);
> +
> +int __sched io_schedule_hrtimeout(ktime_t *expires,
> +			       const enum hrtimer_mode mode)
> +{
> +	return io_schedule_hrtimeout_range(expires, 0, mode);
> +}
> +EXPORT_SYMBOL_GPL(io_schedule_hrtimeout);
> diff --git a/kernel/sched.c b/kernel/sched.c
> index d5564a8..ac84455 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -5303,6 +5303,23 @@ long __sched io_schedule_timeout(long timeout)
>  	return ret;
>  }
>  
> +int __sched
> +io_schedule_hrtimeout_range_clock(ktime_t *expires, unsigned long delta,
> +			       const enum hrtimer_mode mode, int clock)
> +{
> +	struct rq *rq = raw_rq();
> +	long ret;
> +
> +	delayacct_blkio_start();
> +	atomic_inc(&rq->nr_iowait);
> +	current->in_iowait = 1;
> +	ret = schedule_hrtimeout_range_clock(expires, delta, mode, clock);
> +	current->in_iowait = 0;
> +	atomic_dec(&rq->nr_iowait);
> +	delayacct_blkio_end();
> +	return ret;
> +}
> +
>  /**
>   * sys_sched_get_priority_max - return maximum RT priority.
>   * @policy: scheduling class.
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-11-18 13:26     ` Wu Fengguang
@ 2010-11-18 13:40       ` Peter Zijlstra
  2010-11-18 14:02         ` Wu Fengguang
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2010-11-18 13:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Chris Mason, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Theodore Ts'o, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML, tglx

On Thu, 2010-11-18 at 21:26 +0800, Wu Fengguang wrote:
> On Thu, Nov 18, 2010 at 09:04:34PM +0800, Peter Zijlstra wrote:
> > On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> > > - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> > > - avoid too small pause time (less than  10ms, which burns CPU power)
> > > - avoid too large pause time (more than 100ms, which hurts responsiveness)
> > > - avoid big fluctuations of pause times 
> > 
> > If you feel like playing with sub-jiffies timeouts (a way to avoid that
> > HZ=>100 assumption), the below (totally untested) patch might be of
> > help..
> 
> Assuming there are HZ=10 users.
> 
> - when choosing such a coarse granularity, do they really care about
>   responsiveness? :)

No of course not, they usually care about booting their system,.. I've
been told booting Linux on a 10Mhz FPGA is 'fun' :-)

> - will the use of hrtimer add a little code size and/or runtime
>   overheads, and hence hurt the majority HZ=100 users?

Yes it will add code and runtime overhead, but it would allow you to
have 1ms timeouts even on a HZ=100 system, as opposed to a 10ms minimum.

Anyway, I'm not saying you should do it, I just wondered if we had the
API, saw we didn't and thought it might be nice to offer it if desired.



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-11-18 13:40       ` Peter Zijlstra
@ 2010-11-18 14:02         ` Wu Fengguang
  0 siblings, 0 replies; 71+ messages in thread
From: Wu Fengguang @ 2010-11-18 14:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Chris Mason, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Theodore Ts'o, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML, tglx

On Thu, Nov 18, 2010 at 09:40:06PM +0800, Peter Zijlstra wrote:
> On Thu, 2010-11-18 at 21:26 +0800, Wu Fengguang wrote:
> > On Thu, Nov 18, 2010 at 09:04:34PM +0800, Peter Zijlstra wrote:
> > > On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> > > > - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> > > > - avoid too small pause time (less than  10ms, which burns CPU power)
> > > > - avoid too large pause time (more than 100ms, which hurts responsiveness)
> > > > - avoid big fluctuations of pause times 
> > > 
> > > If you feel like playing with sub-jiffies timeouts (a way to avoid that
> > > HZ=>100 assumption), the below (totally untested) patch might be of
> > > help..
> > 
> > Assuming there are HZ=10 users.
> > 
> > - when choosing such a coarse granularity, do they really care about
> >   responsiveness? :)
> 
> No of course not, they usually care about booting their system,.. I've
> been told booting Linux on a 10Mhz FPGA is 'fun' :-)

Wow, it's amazing Linux can run on it at all :)

> > - will the use of hrtimer add a little code size and/or runtime
> >   overheads, and hence hurt the majority HZ=100 users?
> 
> Yes it will add code and runtime overhead, but it would allow you to
> have 1ms timeouts even on a HZ=100 system, as opposed to a 10ms minimum.

Yeah, Dave Chinner once pointed out 1ms sleep may be desirable on
really fast storage. That may help if there is only one really fast
dirtier. Let's see if there will come such user demands.

But for now, amusingly, the demand is to have 100-200ms pause time for
reducing CPU overheads when there are hundreds of concurrent dirtiers.
The number is pretty easy to tune in itself, but I find the downside
of much bigger fluctuations. So I'm now trying ways to keep it under
control..

> Anyway, I'm not saying you should do it, I just wondered if we had the
> API, saw we didn't and thought it might be nice to offer it if desired.

Thanks for the offer. We can sure do it when there comes about some
loud user complaint :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 00/13] IO-less dirty throttling v2
  2010-11-18  7:33             ` Andrew Morton
@ 2010-11-19  3:11               ` Dave Chinner
  0 siblings, 0 replies; 71+ messages in thread
From: Dave Chinner @ 2010-11-19  3:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, Jan Kara, Christoph Hellwig, Theodore Ts'o,
	Chris Mason, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, Nov 17, 2010 at 11:33:50PM -0800, Andrew Morton wrote:
> On Thu, 18 Nov 2010 18:27:06 +1100 Dave Chinner <david@fromorbit.com> wrote:
> 
> > > > Indeed, nobody has
> > > > realised (until now) just how inefficient it really is because of
> > > > the fact that the overhead is mostly hidden in user process system
> > > > time.
> > > 
> > > "hidden"?  You do "time dd" and look at the output!
> > > 
> > > _now_ it's hidden.  You do "time dd" and whee, no system time!
> > 
> > What I meant is that the cost of foreground writeback was hidden in
> > the process system time. Now we have separated the two of them, we
> > can see exactly how much it was costing us because it is no longer
> > hidden inside the process system time.
> 
> About a billion years ago I wrote the "cyclesoak" thingy which measures
> CPU utilisation the other way around: run a lowest-priority process on
> each CPU in the background, while running your workload, then find out
> how much CPU time cyclesoak *didn't* consume.  That way you account for
> everything: user time, system time, kernel threads, interrupts,
> softirqs, etc.  It turned out to be pretty accurate, despite the
> then-absence of SCHED_IDLE.

Yeah, I just use PCP to tell me what the CPU usage is in a nice
graph. The link below is an image of the "overview" monitoring tab I
have - total CPU, IOPS, bandwidth, XFS directory ops and context
switches. Here's the behaviour an increasing number of dd's with
this series looks like:

http://userweb.kernel.org/~dgc/io-less-throttle-dd.png

Left to right, that 1 dd, 2, 4, 8, 16 and 32 dd's, then a gap, then
the 8-way fs_mark workload running. These are all taken at a 5s
sample period.

FWIW, on the 32 thread dd (the right most of the set of pillars),
you can see the sudden increase in system CPU usage in the last few
samples (which corresponds to the first few dd's completing and
exiting) that I mentioned previously.

Basically, I'm always looking at the total CPU usage of a workload,
memory usage of caches, etc, in this manner.  Sure, I use stuff like
time to get numbers to drop out of test scripts, but most of my
behavioural analysis is done through observing differences between
two charts and then looking deeper to work out what changed...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-11-17 10:34   ` Minchan Kim
@ 2010-11-22  2:01     ` Wu Fengguang
  0 siblings, 0 replies; 71+ messages in thread
From: Wu Fengguang @ 2010-11-22  2:01 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Jan Kara, Chris Mason, Dave Chinner,
	Peter Zijlstra, Jens Axboe, Christoph Hellwig, Theodore Ts'o,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, linux-mm,
	linux-fsdevel, LKML

Hi Minchan,

On Wed, Nov 17, 2010 at 06:34:26PM +0800, Minchan Kim wrote:
> Hi Wu,
> 
> As you know, I am not a expert in this area.
> So I hope my review can help understanding other newbie like me and
> make clear this document. :)
> I didn't look into the code. before it, I would like to clear your concept.

Yeah, it's some big change of "concept" :)

Sorry for the late reply, as I'm still tuning things and some details
may change as a result. The biggest challenge now is the stability of
the control algorithms. Everything is floating around and I'm trying
to keep the fluctuations down by borrowing some equation from the
optimal control theory.

> On Wed, Nov 17, 2010 at 1:27 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > As proposed by Chris, Dave and Jan, don't start foreground writeback IO
> > inside balance_dirty_pages(). Instead, simply let it idle sleep for some
> > time to throttle the dirtying task. In the mean while, kick off the
> > per-bdi flusher thread to do background writeback IO.
> >
> > This patch introduces the basic framework, which will be further
> > consolidated by the next patches.
> >
> > RATIONALS
> > =========
> >
> > The current balance_dirty_pages() is rather IO inefficient.
> >
> > - concurrent writeback of multiple inodes (Dave Chinner)
> >
> >  If every thread doing writes and being throttled start foreground
> >  writeback, it leads to N IO submitters from at least N different
> >  inodes at the same time, end up with N different sets of IO being
> >  issued with potentially zero locality to each other, resulting in
> >  much lower elevator sort/merge efficiency and hence we seek the disk
> >  all over the place to service the different sets of IO.
> >  OTOH, if there is only one submission thread, it doesn't jump between
> >  inodes in the same way when congestion clears - it keeps writing to
> >  the same inode, resulting in large related chunks of sequential IOs
> >  being issued to the disk. This is more efficient than the above
> >  foreground writeback because the elevator works better and the disk
> >  seeks less.
> >
> > - IO size too small for fast arrays and too large for slow USB sticks
> >
> >  The write_chunk used by current balance_dirty_pages() cannot be
> >  directly set to some large value (eg. 128MB) for better IO efficiency.
> >  Because it could lead to more than 1 second user perceivable stalls.
> >  Even the current 4MB write size may be too large for slow USB sticks.
> >  The fact that balance_dirty_pages() starts IO on itself couples the
> >  IO size to wait time, which makes it hard to do suitable IO size while
> >  keeping the wait time under control.
> >
> > For the above two reasons, it's much better to shift IO to the flusher
> > threads and let balance_dirty_pages() just wait for enough time or progress.
> >
> > Jan Kara, Dave Chinner and me explored the scheme to let
> > balance_dirty_pages() wait for enough writeback IO completions to
> > safeguard the dirty limit. However it's found to have two problems:
> >
> > - in large NUMA systems, the per-cpu counters may have big accounting
> >  errors, leading to big throttle wait time and jitters.
> >
> > - NFS may kill large amount of unstable pages with one single COMMIT.
> >  Because NFS server serves COMMIT with expensive fsync() IOs, it is
> >  desirable to delay and reduce the number of COMMITs. So it's not
> >  likely to optimize away such kind of bursty IO completions, and the
> >  resulted large (and tiny) stall times in IO completion based throttling.
> >
> > So here is a pause time oriented approach, which tries to control the
> > pause time in each balance_dirty_pages() invocations, by controlling
> > the number of pages dirtied before calling balance_dirty_pages(), for
> > smooth and efficient dirty throttling:
> >
> > - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> > - avoid too small pause time (less than  10ms, which burns CPU power)
> > - avoid too large pause time (more than 100ms, which hurts responsiveness)
> > - avoid big fluctuations of pause times
> >
> > For example, when doing a simple cp on ext4 with mem=4G HZ=250.
> >
> > before patch, the pause time fluctuates from 0 to 324ms
> > (and the stall time may grow very large for slow devices)
> >
> > [ 1237.139962] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
> > [ 1237.207489] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
> > [ 1237.225190] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
> > [ 1237.234488] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
> > [ 1237.244692] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
> > [ 1237.375231] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
> > [ 1237.443035] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
> > [ 1237.574630] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
> > [ 1237.642394] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
> > [ 1237.666320] balance_dirty_pages: write_chunk=1536 pages_written=57 pause=5
> > [ 1237.973365] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=81
> > [ 1238.212626] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
> > [ 1238.280431] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
> > [ 1238.412029] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
> > [ 1238.412791] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
> >
> > after patch, the pause time remains stable around 32ms
> >
> > cp-2687  [002]  1452.237012: balance_dirty_pages: weight=56% dirtied=128 pause=8
> > cp-2687  [002]  1452.246157: balance_dirty_pages: weight=56% dirtied=128 pause=8
> > cp-2687  [006]  1452.253043: balance_dirty_pages: weight=56% dirtied=128 pause=8
> > cp-2687  [006]  1452.261899: balance_dirty_pages: weight=57% dirtied=128 pause=8
> > cp-2687  [006]  1452.268939: balance_dirty_pages: weight=57% dirtied=128 pause=8
> > cp-2687  [002]  1452.276932: balance_dirty_pages: weight=57% dirtied=128 pause=8
> > cp-2687  [002]  1452.285889: balance_dirty_pages: weight=57% dirtied=128 pause=8
> >
> > CONTROL SYSTEM
> > ==============
> >
> > The current task_dirty_limit() adjusts bdi_dirty_limit to get
> > task_dirty_limit according to the dirty "weight" of the current task,
> > which is the percent of pages recently dirtied by the task. If 100%
> > pages are recently dirtied by the task, it will lower bdi_dirty_limit by
> > 1/8. If only 1% pages are dirtied by the task, it will return almost
> > unmodified bdi_dirty_limit. In this way, a heavy dirtier will get
> > blocked at task_dirty_limit=(bdi_dirty_limit-bdi_dirty_limit/8) while
> > allowing a light dirtier to progress (the latter won't be blocked
> > because R << B in fig.1).
> >
> > Fig.1 before patch, a heavy dirtier and a light dirtier
> >                                                R
> > ----------------------------------------------+-o---------------------------*--|
> >                                              L A                           B  T
> >  T: bdi_dirty_limit, as returned by bdi_dirty_limit()
> >  L: T - T/8
> >
> >  R: bdi_reclaimable + bdi_writeback
> >
> >  A: task_dirty_limit for a heavy dirtier ~= R ~= L
> >  B: task_dirty_limit for a light dirtier ~= T
> >
> > Since each process has its own dirty limit, we reuse A/B for the tasks as
> > well as their dirty limits.
> >
> > If B is a newly started heavy dirtier, then it will slowly gain weight
> > and A will lose weight.  The task_dirty_limit for A and B will be
> > approaching the center of region (L, T) and eventually stabilize there.
> >
> > Fig.2 before patch, two heavy dirtiers converging to the same threshold
> >                                                             R
> > ----------------------------------------------+--------------o-*---------------|
> >                                              L              A B               T
> 
> Seems good until now.
> So, What's the problem if two heavy dirtiers have a same threshold?

That's not a problem. It's the proper behavior to converge for two
"dd"s.

> > Fig.3 after patch, one heavy dirtier
> >                                                |
> >    throttle_bandwidth ~= bdi_bandwidth  =>     o
> >                                                | o
> >                                                |   o
> >                                                |     o
> >                                                |       o
> >                                                |         o
> >                                              La|           o
> > ----------------------------------------------+-+-------------o----------------|
> >                                                R             A                T
> >  T: bdi_dirty_limit
> >  A: task_dirty_limit      = T - Wa * T/16
> >  La: task_throttle_thresh = A - A/16
> >
> >  R: bdi_dirty_pages = bdi_reclaimable + bdi_writeback ~= La
> >
> > Now for IO-less balance_dirty_pages(), let's do it in a "bandwidth control"
> > way. In fig.3, a soft dirty limit region (La, A) is introduced. When R enters
> > this region, the task may be throttled for J jiffies on every N pages it dirtied.
> > Let's call (N/J) the "throttle bandwidth". It is computed by the following formula:
> >
> >        throttle_bandwidth = bdi_bandwidth * (A - R) / (A - La)
> > where
> >        A = T - Wa * T/16
> >        La = A - A/16
> > where Wa is task weight for A. It's 0 for very light dirtier and 1 for
> > the one heavy dirtier (that consumes 100% bdi write bandwidth).  The
> > task weight will be updated independently by task_dirty_inc() at
> > set_page_dirty() time.
> 
> 
> Dumb question.
> 
> I can't see the difference between old and new,
> La depends on A.
> A depends on Wa.
> T is constant?

T is the bdi's share of the global dirty limit. It's stable in normal,
and here we use it as the reference point for per-bdi dirty throttling.

> Then, throttle_bandwidth depends on Wa.

Sure, each task will be throttled at different bandwidth if there
"Wa" are different.

> Wa depends on the number of dirtied pages during some interval.
> So if light dirtier become heavy, at last light dirtier and heavy
> dirtier will have a same weight.
> It means throttle_bandwidth is same. It's a same with old result.

Yeah. Wa and throttle_bandwidth is changing over time.
 
> Please, open my eyes. :)

You get the dynamics right :)

> Thanks for the great work.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 03/13] writeback: per-task rate limit on balance_dirty_pages()
  2010-11-17  4:27 ` [PATCH 03/13] writeback: per-task rate limit on balance_dirty_pages() Wu Fengguang
  2010-11-17 14:39   ` Wu Fengguang
@ 2010-11-24 10:23   ` Peter Zijlstra
  2010-11-24 10:43     ` Wu Fengguang
  1 sibling, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2010-11-24 10:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> +       if (unlikely(current->nr_dirtied >= current->nr_dirtied_pause ||
> +                    bdi->dirty_exceeded)) {
> +               balance_dirty_pages(mapping, current->nr_dirtied);
> +               current->nr_dirtied = 0;
>         } 

Was it a conscious choice to use
  current->nr_dirtied = 0
over 
  current->nr_dirtied -= current->nr_dirtied_pause
?

The former will cause a drift in pause times due to truncation of the
excess.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/13] writeback: account per-bdi accumulated written pages
  2010-11-17  4:27 ` [PATCH 05/13] writeback: account per-bdi accumulated written pages Wu Fengguang
@ 2010-11-24 10:26   ` Peter Zijlstra
  2010-11-24 10:44     ` Wu Fengguang
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2010-11-24 10:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> @@ -1292,6 +1292,7 @@ int test_clear_page_writeback(struct pag
>                                                 PAGECACHE_TAG_WRITEBACK);
>                         if (bdi_cap_account_writeback(bdi)) {
>                                 __dec_bdi_stat(bdi, BDI_WRITEBACK);
> +                               __inc_bdi_stat(bdi, BDI_WRITTEN);
>                                 __bdi_writeout_inc(bdi);
>                         }
>                 } 

Shouldn't that live in __bdi_writeout_inc()? It looks like this forgets
about fuse (fuse_writepage_finish() -> bdi_writeout_inc() ->
__bdi_writeout_inc()).

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 03/13] writeback: per-task rate limit on balance_dirty_pages()
  2010-11-24 10:23   ` Peter Zijlstra
@ 2010-11-24 10:43     ` Wu Fengguang
  2010-11-24 10:49       ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Wu Fengguang @ 2010-11-24 10:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, Nov 24, 2010 at 06:23:07PM +0800, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> > +       if (unlikely(current->nr_dirtied >= current->nr_dirtied_pause ||
> > +                    bdi->dirty_exceeded)) {
> > +               balance_dirty_pages(mapping, current->nr_dirtied);
> > +               current->nr_dirtied = 0;
> >         } 
> 
> Was it a conscious choice to use
>   current->nr_dirtied = 0
> over 
>   current->nr_dirtied -= current->nr_dirtied_pause
> ?
> 
> The former will cause a drift in pause times due to truncation of the
> excess.

It should be fine in either way, as long as the "truncated" number is
passed to balance_dirty_pages():

+               balance_dirty_pages(mapping, current->nr_dirtied);
+               current->nr_dirtied = 0;

or

+               balance_dirty_pages(mapping, current->nr_dirtied_pause);
+               current->nr_dirtied -= current->nr_dirtied_pause;

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/13] writeback: account per-bdi accumulated written pages
  2010-11-24 10:26   ` Peter Zijlstra
@ 2010-11-24 10:44     ` Wu Fengguang
  0 siblings, 0 replies; 71+ messages in thread
From: Wu Fengguang @ 2010-11-24 10:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, Nov 24, 2010 at 06:26:16PM +0800, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> > @@ -1292,6 +1292,7 @@ int test_clear_page_writeback(struct pag
> >                                                 PAGECACHE_TAG_WRITEBACK);
> >                         if (bdi_cap_account_writeback(bdi)) {
> >                                 __dec_bdi_stat(bdi, BDI_WRITEBACK);
> > +                               __inc_bdi_stat(bdi, BDI_WRITTEN);
> >                                 __bdi_writeout_inc(bdi);
> >                         }
> >                 } 
> 
> Shouldn't that live in __bdi_writeout_inc()? It looks like this forgets
> about fuse (fuse_writepage_finish() -> bdi_writeout_inc() ->
> __bdi_writeout_inc()).

Good catch! Will fix it in next post.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 03/13] writeback: per-task rate limit on balance_dirty_pages()
  2010-11-24 10:43     ` Wu Fengguang
@ 2010-11-24 10:49       ` Peter Zijlstra
  0 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2010-11-24 10:49 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 2010-11-24 at 18:43 +0800, Wu Fengguang wrote:
> On Wed, Nov 24, 2010 at 06:23:07PM +0800, Peter Zijlstra wrote:
> > On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> > > +       if (unlikely(current->nr_dirtied >= current->nr_dirtied_pause ||
> > > +                    bdi->dirty_exceeded)) {
> > > +               balance_dirty_pages(mapping, current->nr_dirtied);
> > > +               current->nr_dirtied = 0;
> > >         } 
> > 
> > Was it a conscious choice to use
> >   current->nr_dirtied = 0
> > over 
> >   current->nr_dirtied -= current->nr_dirtied_pause
> > ?
> > 
> > The former will cause a drift in pause times due to truncation of the
> > excess.
> 
> It should be fine in either way, as long as the "truncated" number is
> passed to balance_dirty_pages():
> 
> +               balance_dirty_pages(mapping, current->nr_dirtied);
> +               current->nr_dirtied = 0;
> 
> or
> 
> +               balance_dirty_pages(mapping, current->nr_dirtied_pause);
> +               current->nr_dirtied -= current->nr_dirtied_pause;

ok, just wanted to make sure you'd considered it.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-17  4:27 ` [PATCH 06/13] writeback: bdi write bandwidth estimation Wu Fengguang
  2010-11-17 23:08   ` Andrew Morton
@ 2010-11-24 10:58   ` Peter Zijlstra
  2010-11-24 14:06     ` Wu Fengguang
  2010-11-24 11:05   ` Peter Zijlstra
  2 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2010-11-24 10:58 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Li Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:

> @@ -555,8 +592,10 @@ static void balance_dirty_pages(struct a
>  		pause = clamp_val(pause, 1, HZ/10);
>  
>  pause:
> +		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
>  		__set_current_state(TASK_INTERRUPTIBLE);
>  		io_schedule_timeout(pause);
> +		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
>  
>  		/*
>  		 * The bdi thresh is somehow "soft" limit derived from the

So its really a two part bandwidth calculation, the first call is:

  bdi_get_bandwidth()

and the second call is:

  bdi_update_bandwidth()

Would it make sense to actually implement it with two functions instead
of overloading the functionality of the one function?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-17  4:27 ` [PATCH 06/13] writeback: bdi write bandwidth estimation Wu Fengguang
  2010-11-17 23:08   ` Andrew Morton
  2010-11-24 10:58   ` Peter Zijlstra
@ 2010-11-24 11:05   ` Peter Zijlstra
  2010-11-24 12:10     ` Wu Fengguang
  2 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2010-11-24 11:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Li Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> +void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
> +                               unsigned long *bw_time,
> +                               s64 *bw_written)
> +{
> +       unsigned long written;
> +       unsigned long elapsed;
> +       unsigned long bw;
> +       unsigned long w;
> +
> +       if (*bw_written == 0)
> +               goto snapshot;
> +
> +       elapsed = jiffies - *bw_time;
> +       if (elapsed < HZ/100)
> +               return;
> +
> +       /*
> +        * When there lots of tasks throttled in balance_dirty_pages(), they
> +        * will each try to update the bandwidth for the same period, making
> +        * the bandwidth drift much faster than the desired rate (as in the
> +        * single dirtier case). So do some rate limiting.
> +        */
> +       if (jiffies - bdi->write_bandwidth_update_time < elapsed)
> +               goto snapshot;

Why this goto snapshot and not simply return? This is the second call
(bdi_update_bandwidth equivalent).

If you were to leave the old bw_written/bw_time in place the next loop
around in wb_writeback() would see a larger delta..

I guess this funny loop in wb_writeback() is also the reason you've got
a single function and not the get/update like separation

> +       written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]) - *bw_written;
> +       bw = (HZ * PAGE_CACHE_SIZE * written + elapsed/2) / elapsed;
> +       w = min(elapsed / (HZ/100), 128UL);
> +       bdi->write_bandwidth = (bdi->write_bandwidth * (1024-w) + bw * w) >> 10;
> +       bdi->write_bandwidth_update_time = jiffies;
> +snapshot:
> +       *bw_written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
> +       *bw_time = jiffies;
> +} 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 00/13] IO-less dirty throttling v2
  2010-11-18  2:09     ` Andrew Morton
  2010-11-18  3:21       ` Dave Chinner
@ 2010-11-24 11:12       ` Avi Kivity
  1 sibling, 0 replies; 71+ messages in thread
From: Avi Kivity @ 2010-11-24 11:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Wu Fengguang, Jan Kara, Christoph Hellwig,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On 11/18/2010 04:09 AM, Andrew Morton wrote:
> But mainly because we're taking the work accounting away from the user
> who caused it and crediting it to the kernel thread instead, and that's
> an actively *bad* thing to do.
>

That's happening more and more with workqueues and kernel threads.

We need the ability for a kernel thread (perhaps a workqueue thread) to 
say "I am doing this on behalf of thread X, please charge any costs I 
incur (faults, cpu time, whatever) to that thread".

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 08/13] writeback: quit throttling when bdi dirty pages dropped low
  2010-11-17  4:27 ` [PATCH 08/13] writeback: quit throttling when bdi dirty pages dropped low Wu Fengguang
@ 2010-11-24 11:13   ` Peter Zijlstra
  2010-11-24 12:30     ` Wu Fengguang
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2010-11-24 11:13 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:

> @@ -578,6 +579,25 @@ static void balance_dirty_pages(struct a
>  				    bdi_stat(bdi, BDI_WRITEBACK);
>  		}
>  
> +		/*
> +		 * bdi_thresh takes time to ramp up from the initial 0,
> +		 * especially for slow devices.
> +		 *
> +		 * It's possible that at the moment dirty throttling starts,
> +		 * 	bdi_dirty = nr_dirty
> +		 * 		  = (background_thresh + dirty_thresh) / 2
> +		 * 		  >> bdi_thresh
> +		 * Then the task could be blocked for a dozen second to flush
> +		 * all the exceeded (bdi_dirty - bdi_thresh) pages. So offer a
> +		 * complementary way to break out of the loop when 250ms worth
> +		 * of dirty pages have been cleaned during our pause time.
> +		 */
> +		if (nr_dirty < dirty_thresh &&
> +		    bdi_prev_dirty - bdi_dirty >
> +		    bdi->write_bandwidth >> (PAGE_CACHE_SHIFT + 2))
> +			break;
> +		bdi_prev_dirty = bdi_dirty;
> +
>  		if (bdi_dirty >= bdi_thresh) {
>  			pause = HZ/10;
>  			goto pause;


So we're testing to see if during our pause time (<=100ms) we've written
out 250ms worth of pages (given our current bandwidth estimation),
right? 

(1/4th of bandwidth in bytes/s is bytes per 0.25s) 

(and in your recent patches you've changed the bw to pages/s so I take
it the PAGE_CACHE_SIZE will be gone from all these sites).



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 09/13] writeback: reduce per-bdi dirty threshold ramp up time
  2010-11-17  4:27 ` [PATCH 09/13] writeback: reduce per-bdi dirty threshold ramp up time Wu Fengguang
@ 2010-11-24 11:15   ` Peter Zijlstra
  2010-11-24 12:39     ` Wu Fengguang
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2010-11-24 11:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Richard Kennedy, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> plain text document attachment
> (writeback-speedup-per-bdi-threshold-ramp-up.patch)
> Reduce the dampening for the control system, yielding faster
> convergence.
> 
> Currently it converges at a snail's pace for slow devices (in order of
> minutes).  For really fast storage, the convergence speed should be fine.
> 
> It makes sense to make it reasonably fast for typical desktops.
> 
> After patch, it converges in ~10 seconds for 60MB/s writes and 4GB mem.
> So expect ~1s for a fast 600MB/s storage under 4GB mem, or ~4s under
> 16GB mem, which seems reasonable.
> 
> $ while true; do grep BdiDirtyThresh /debug/bdi/8:0/stats; sleep 1; done
> BdiDirtyThresh:            0 kB
> BdiDirtyThresh:       118748 kB
> BdiDirtyThresh:       214280 kB
> BdiDirtyThresh:       303868 kB
> BdiDirtyThresh:       376528 kB
> BdiDirtyThresh:       411180 kB
> BdiDirtyThresh:       448636 kB
> BdiDirtyThresh:       472260 kB
> BdiDirtyThresh:       490924 kB
> BdiDirtyThresh:       499596 kB
> BdiDirtyThresh:       507068 kB
> ...
> DirtyThresh:          530392 kB
> 
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> CC: Richard Kennedy <richard@rsk.demon.co.uk>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2010-11-15 13:08:16.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-11-15 13:08:28.000000000 +0800
> @@ -125,7 +125,7 @@ static int calc_period_shift(void)
>  	else
>  		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
>  				100;
> -	return 2 + ilog2(dirty_total - 1);
> +	return ilog2(dirty_total - 1) - 1;
>  }
>  
>  /*

You could actually improve upon this now that you have per-bdi bandwidth
estimations, simply set the period to (seconds * bandwidth) to get
convergence in @seconds.



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 10/13] writeback: make reasonable gap between the dirty/background thresholds
  2010-11-17  4:27 ` [PATCH 10/13] writeback: make reasonable gap between the dirty/background thresholds Wu Fengguang
@ 2010-11-24 11:18   ` Peter Zijlstra
  2010-11-24 12:48     ` Wu Fengguang
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2010-11-24 11:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> plain text document attachment
> (writeback-fix-oversize-background-thresh.patch)
> The change is virtually a no-op for the majority users that use the
> default 10/20 background/dirty ratios. For others don't know why they
> are setting background ratio close enough to dirty ratio. Someone must
> set background ratio equal to dirty ratio, but no one seems to notice or
> complain that it's then silently halved under the hood..
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |   11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2010-11-15 13:12:50.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-11-15 13:13:42.000000000 +0800
> @@ -403,8 +403,15 @@ void global_dirty_limits(unsigned long *
>  	else
>  		background = (dirty_background_ratio * available_memory) / 100;
>  
> -	if (background >= dirty)
> -		background = dirty / 2;
> +	/*
> +	 * Ensure at least 1/4 gap between background and dirty thresholds, so
> +	 * that when dirty throttling starts at (background + dirty)/2, it's at
> +	 * the entrance of bdi soft throttle threshold, so as to avoid being
> +	 * hard throttled.
> +	 */
> +	if (background > dirty - dirty * 2 / BDI_SOFT_DIRTY_LIMIT)
> +		background = dirty - dirty * 2 / BDI_SOFT_DIRTY_LIMIT;
> +
>  	tsk = current;
>  	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
>  		background += background / 4;


Hrm,.. the alternative is to return -ERANGE or somesuch when people try
to write nonsensical values.

I'm not sure what's best, guessing at what the user did mean to do or
forcing him to actually think.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 11:05   ` Peter Zijlstra
@ 2010-11-24 12:10     ` Wu Fengguang
  2010-11-24 12:50       ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Wu Fengguang @ 2010-11-24 12:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, Nov 24, 2010 at 07:05:32PM +0800, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> > +void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
> > +                               unsigned long *bw_time,
> > +                               s64 *bw_written)
> > +{
> > +       unsigned long written;
> > +       unsigned long elapsed;
> > +       unsigned long bw;
> > +       unsigned long w;
> > +
> > +       if (*bw_written == 0)
> > +               goto snapshot;
> > +
> > +       elapsed = jiffies - *bw_time;
> > +       if (elapsed < HZ/100)
> > +               return;
> > +
> > +       /*
> > +        * When there lots of tasks throttled in balance_dirty_pages(), they
> > +        * will each try to update the bandwidth for the same period, making
> > +        * the bandwidth drift much faster than the desired rate (as in the
> > +        * single dirtier case). So do some rate limiting.
> > +        */
> > +       if (jiffies - bdi->write_bandwidth_update_time < elapsed)
> > +               goto snapshot;
>
> Why this goto snapshot and not simply return? This is the second call
> (bdi_update_bandwidth equivalent).

Good question. The loop inside balance_dirty_pages() normally run only
once, however wb_writeback() may loop over and over again. If we just
return here, the condition

        (jiffies - bdi->write_bandwidth_update_time < elapsed)

cannot be reset, then future bdi_update_bandwidth() calls in the same
wb_writeback() loop will never find it OK to update the bandwidth.

It does assume no races between CPUs.. We may need some per-cpu based
estimation.

> If you were to leave the old bw_written/bw_time in place the next loop
> around in wb_writeback() would see a larger delta..
>
> I guess this funny loop in wb_writeback() is also the reason you've got
> a single function and not the get/update like separation

I do the single function mainly for wb_writeback(), where it
continuously updates bandwidth inside the loop. The function can be
called in such way:

loop {
        take snapshot on first call

        no action if recalled within 10ms
        ...
        no action if recalled within 10ms
        ...

        update bandwidth and prepare for next update by taking the snapshot

        no action if recalled within 10ms
        ...
}

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 08/13] writeback: quit throttling when bdi dirty pages dropped low
  2010-11-24 11:13   ` Peter Zijlstra
@ 2010-11-24 12:30     ` Wu Fengguang
  2010-11-24 12:46       ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Wu Fengguang @ 2010-11-24 12:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, Nov 24, 2010 at 07:13:53PM +0800, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> 
> > @@ -578,6 +579,25 @@ static void balance_dirty_pages(struct a
> >  				    bdi_stat(bdi, BDI_WRITEBACK);
> >  		}
> >  
> > +		/*
> > +		 * bdi_thresh takes time to ramp up from the initial 0,
> > +		 * especially for slow devices.
> > +		 *
> > +		 * It's possible that at the moment dirty throttling starts,
> > +		 * 	bdi_dirty = nr_dirty
> > +		 * 		  = (background_thresh + dirty_thresh) / 2
> > +		 * 		  >> bdi_thresh
> > +		 * Then the task could be blocked for a dozen second to flush
> > +		 * all the exceeded (bdi_dirty - bdi_thresh) pages. So offer a
> > +		 * complementary way to break out of the loop when 250ms worth
> > +		 * of dirty pages have been cleaned during our pause time.
> > +		 */
> > +		if (nr_dirty < dirty_thresh &&
> > +		    bdi_prev_dirty - bdi_dirty >
> > +		    bdi->write_bandwidth >> (PAGE_CACHE_SHIFT + 2))
> > +			break;
> > +		bdi_prev_dirty = bdi_dirty;
> > +
> >  		if (bdi_dirty >= bdi_thresh) {
> >  			pause = HZ/10;
> >  			goto pause;
> 
> 
> So we're testing to see if during our pause time (<=100ms) we've written
> out 250ms worth of pages (given our current bandwidth estimation),
> right? 
> 
> (1/4th of bandwidth in bytes/s is bytes per 0.25s) 

Right.

> (and in your recent patches you've changed the bw to pages/s so I take
> it the PAGE_CACHE_SIZE will be gone from all these sites).

Yeah. Actually I did one more fix after that. The break is designed
mainly to help single task case. It helps less for concurrent dirtier
cases, however for long run servers I guess they don't really care
some boot time lags.

For the 1-dd case, it looks better to lower the break threshold to
125ms. After all, it's not easy for the dirty pages to drop by 250ms
worth of data when you only slept 200ms (note: the max pause time has
been doubled mainly for servers).

-               if (nr_dirty < dirty_thresh &&
-                   bdi_prev_dirty - bdi_dirty > (long)bdi->write_bandwidth / 4)
+               if (nr_dirty <= dirty_thresh &&
+                   bdi_prev_dirty - bdi_dirty > (long)bdi->write_bandwidth / 8)
                        break;

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 09/13] writeback: reduce per-bdi dirty threshold ramp up time
  2010-11-24 11:15   ` Peter Zijlstra
@ 2010-11-24 12:39     ` Wu Fengguang
  2010-11-24 12:56       ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Wu Fengguang @ 2010-11-24 12:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Richard Kennedy, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, Nov 24, 2010 at 07:15:41PM +0800, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> > plain text document attachment
> > (writeback-speedup-per-bdi-threshold-ramp-up.patch)
> > Reduce the dampening for the control system, yielding faster
> > convergence.
> > 
> > Currently it converges at a snail's pace for slow devices (in order of
> > minutes).  For really fast storage, the convergence speed should be fine.
> > 
> > It makes sense to make it reasonably fast for typical desktops.
> > 
> > After patch, it converges in ~10 seconds for 60MB/s writes and 4GB mem.
> > So expect ~1s for a fast 600MB/s storage under 4GB mem, or ~4s under
> > 16GB mem, which seems reasonable.
> > 
> > $ while true; do grep BdiDirtyThresh /debug/bdi/8:0/stats; sleep 1; done
> > BdiDirtyThresh:            0 kB
> > BdiDirtyThresh:       118748 kB
> > BdiDirtyThresh:       214280 kB
> > BdiDirtyThresh:       303868 kB
> > BdiDirtyThresh:       376528 kB
> > BdiDirtyThresh:       411180 kB
> > BdiDirtyThresh:       448636 kB
> > BdiDirtyThresh:       472260 kB
> > BdiDirtyThresh:       490924 kB
> > BdiDirtyThresh:       499596 kB
> > BdiDirtyThresh:       507068 kB
> > ...
> > DirtyThresh:          530392 kB
> > 
> > CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > CC: Richard Kennedy <richard@rsk.demon.co.uk>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/page-writeback.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > --- linux-next.orig/mm/page-writeback.c	2010-11-15 13:08:16.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2010-11-15 13:08:28.000000000 +0800
> > @@ -125,7 +125,7 @@ static int calc_period_shift(void)
> >  	else
> >  		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> >  				100;
> > -	return 2 + ilog2(dirty_total - 1);
> > +	return ilog2(dirty_total - 1) - 1;
> >  }
> >  
> >  /*
> 
> You could actually improve upon this now that you have per-bdi bandwidth
> estimations, simply set the period to (seconds * bandwidth) to get
> convergence in @seconds.

I'd like to, but there is the global vs. bdi discrepancy to be
addressed first :)

How about doing this simple fix first, and then revisit doing per-bdi
vm_dirties after the bandwidth estimation goes upstream?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 08/13] writeback: quit throttling when bdi dirty pages dropped low
  2010-11-24 12:30     ` Wu Fengguang
@ 2010-11-24 12:46       ` Peter Zijlstra
  2010-11-24 12:59         ` Wu Fengguang
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2010-11-24 12:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 2010-11-24 at 20:30 +0800, Wu Fengguang wrote:
> 
> For the 1-dd case, it looks better to lower the break threshold to
> 125ms. After all, it's not easy for the dirty pages to drop by 250ms
> worth of data when you only slept 200ms (note: the max pause time has
> been doubled mainly for servers).
> 
> -               if (nr_dirty < dirty_thresh &&
> -                   bdi_prev_dirty - bdi_dirty > (long)bdi->write_bandwidth / 4)
> +               if (nr_dirty <= dirty_thresh &&
> +                   bdi_prev_dirty - bdi_dirty > (long)bdi->write_bandwidth / 8)
>                         break;

Hrm, but 125ms worth in 200ms is rather easy, you'd want to keep that
limit above what the pause should give you, right?


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 10/13] writeback: make reasonable gap between the dirty/background thresholds
  2010-11-24 11:18   ` Peter Zijlstra
@ 2010-11-24 12:48     ` Wu Fengguang
  0 siblings, 0 replies; 71+ messages in thread
From: Wu Fengguang @ 2010-11-24 12:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, Nov 24, 2010 at 07:18:18PM +0800, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> > plain text document attachment
> > (writeback-fix-oversize-background-thresh.patch)
> > The change is virtually a no-op for the majority users that use the
> > default 10/20 background/dirty ratios. For others don't know why they
> > are setting background ratio close enough to dirty ratio. Someone must
> > set background ratio equal to dirty ratio, but no one seems to notice or
> > complain that it's then silently halved under the hood..
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/page-writeback.c |   11 +++++++++--
> >  1 file changed, 9 insertions(+), 2 deletions(-)
> > 
> > --- linux-next.orig/mm/page-writeback.c	2010-11-15 13:12:50.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2010-11-15 13:13:42.000000000 +0800
> > @@ -403,8 +403,15 @@ void global_dirty_limits(unsigned long *
> >  	else
> >  		background = (dirty_background_ratio * available_memory) / 100;
> >  
> > -	if (background >= dirty)
> > -		background = dirty / 2;
> > +	/*
> > +	 * Ensure at least 1/4 gap between background and dirty thresholds, so
> > +	 * that when dirty throttling starts at (background + dirty)/2, it's at
> > +	 * the entrance of bdi soft throttle threshold, so as to avoid being
> > +	 * hard throttled.
> > +	 */
> > +	if (background > dirty - dirty * 2 / BDI_SOFT_DIRTY_LIMIT)
> > +		background = dirty - dirty * 2 / BDI_SOFT_DIRTY_LIMIT;
> > +
> >  	tsk = current;
> >  	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> >  		background += background / 4;
> 
> 
> Hrm,.. the alternative is to return -ERANGE or somesuch when people try
> to write nonsensical values.
> 
> I'm not sure what's best, guessing at what the user did mean to do or
> forcing him to actually think.

Yes, this may break user space either way.
Doing it loudly does make more sense.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 12:10     ` Wu Fengguang
@ 2010-11-24 12:50       ` Peter Zijlstra
  2010-11-24 13:14         ` Wu Fengguang
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2010-11-24 12:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 2010-11-24 at 20:10 +0800, Wu Fengguang wrote:
> > > +       /*
> > > +        * When there lots of tasks throttled in balance_dirty_pages(), they
> > > +        * will each try to update the bandwidth for the same period, making
> > > +        * the bandwidth drift much faster than the desired rate (as in the
> > > +        * single dirtier case). So do some rate limiting.
> > > +        */
> > > +       if (jiffies - bdi->write_bandwidth_update_time < elapsed)
> > > +               goto snapshot;
> >
> > Why this goto snapshot and not simply return? This is the second call
> > (bdi_update_bandwidth equivalent).
> 
> Good question. The loop inside balance_dirty_pages() normally run only
> once, however wb_writeback() may loop over and over again. If we just
> return here, the condition
> 
>         (jiffies - bdi->write_bandwidth_update_time < elapsed)
> 
> cannot be reset, then future bdi_update_bandwidth() calls in the same
> wb_writeback() loop will never find it OK to update the bandwidth.

But the thing is, you don't want to reset that, it might loop so fast
you'll throttle all of them, if you keep the pre-throttle value you'll
eventually pass, no?

> It does assume no races between CPUs.. We may need some per-cpu based
> estimation. 

But that multi-writer race is valid even for the balance_dirty_pages()
call, two or more could interleave on the bw_time and bw_written
variables.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 09/13] writeback: reduce per-bdi dirty threshold ramp up time
  2010-11-24 12:39     ` Wu Fengguang
@ 2010-11-24 12:56       ` Peter Zijlstra
  0 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2010-11-24 12:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Richard Kennedy, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 2010-11-24 at 20:39 +0800, Wu Fengguang wrote:
> > > @@ -125,7 +125,7 @@ static int calc_period_shift(void)
> > >     else
> > >             dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > >                             100;
> > > -   return 2 + ilog2(dirty_total - 1);
> > > +   return ilog2(dirty_total - 1) - 1;
> > >  }
> > >  
> > >  /*
> > 
> > You could actually improve upon this now that you have per-bdi bandwidth
> > estimations, simply set the period to (seconds * bandwidth) to get
> > convergence in @seconds.
> 
> I'd like to, but there is the global vs. bdi discrepancy to be
> addressed first :)
> 
> How about doing this simple fix first, and then revisit doing per-bdi
> vm_dirties after the bandwidth estimation goes upstream? 

Sure

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 08/13] writeback: quit throttling when bdi dirty pages dropped low
  2010-11-24 12:46       ` Peter Zijlstra
@ 2010-11-24 12:59         ` Wu Fengguang
  0 siblings, 0 replies; 71+ messages in thread
From: Wu Fengguang @ 2010-11-24 12:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, Nov 24, 2010 at 08:46:51PM +0800, Peter Zijlstra wrote:
> On Wed, 2010-11-24 at 20:30 +0800, Wu Fengguang wrote:
> > 
> > For the 1-dd case, it looks better to lower the break threshold to
> > 125ms. After all, it's not easy for the dirty pages to drop by 250ms
> > worth of data when you only slept 200ms (note: the max pause time has
> > been doubled mainly for servers).
> > 
> > -               if (nr_dirty < dirty_thresh &&
> > -                   bdi_prev_dirty - bdi_dirty > (long)bdi->write_bandwidth / 4)
> > +               if (nr_dirty <= dirty_thresh &&
> > +                   bdi_prev_dirty - bdi_dirty > (long)bdi->write_bandwidth / 8)
> >                         break;
> 
> Hrm, but 125ms worth in 200ms is rather easy, you'd want to keep that
> limit above what the pause should give you, right?

Yeah, I exactly mean to quit the loop after sleeping 200ms.
200ms pause already seem large..

I'll leave the safeguard to the "nr_dirty <= dirty_thresh" test :)

This trace shows the problem it tries to solve. Here on fresh boot,
bdi_dirty=106600 which is much larger than bdi_limit=13525.

The same situation may happen if some task quickly eats lots of
anonymous memory, then the global/bdi limits will be knock down
suddenly and it takes time to bring down bdi_dirty.

#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
           <...>-2722  [005]    39.623216: balance_dirty_pages: bdi=8:0 bdi_dirty=106600 bdi_limit=13525 task_limit=13423 task_weight=12% task_gap=-11022% bdi_gap=-5504% dirtied=268 bw=0 ratelimit=0 period=0 think=0 pause=200
           <...>-2720  [004]    39.623222: balance_dirty_pages: bdi=8:0 bdi_dirty=106600 bdi_limit=13513 task_limit=13441 task_weight=9% task_gap=-11029% bdi_gap=-5510% dirtied=267 bw=0 ratelimit=0 period=0 think=0 pause=200                             
           <...>-2717  [003]    39.623666: balance_dirty_pages: bdi=8:0 bdi_dirty=106560 bdi_limit=13742 task_limit=13659 task_weight=10% task_gap=-10815% bdi_gap=-5402% dirtied=268 bw=0 ratelimit=0 period=0 think=0 pause=200                            
           <...>-2718  [004]    39.623990: balance_dirty_pages: bdi=8:0 bdi_dirty=106560 bdi_limit=13979 task_limit=13913 task_weight=8% task_gap=-10603% bdi_gap=-5297% dirtied=272 bw=0 ratelimit=0 period=0 think=0 pause=200                             
           <...>-2715  [007]    39.626472: balance_dirty_pages: bdi=8:0 bdi_dirty=106360 bdi_limit=15312 task_limit=15211 task_weight=11% task_gap=-9523% bdi_gap=-4756% dirtied=268 bw=0 ratelimit=0 period=0 think=0 pause=200                             
           <...>-2723  [001]    39.628951: balance_dirty_pages: bdi=8:0 bdi_dirty=106560 bdi_limit=15749 task_limit=15640 task_weight=11% task_gap=-9236% bdi_gap=-4612% dirtied=269 bw=0 ratelimit=0 period=0 think=0 pause=200                  

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 12:50       ` Peter Zijlstra
@ 2010-11-24 13:14         ` Wu Fengguang
  2010-11-24 13:20           ` Wu Fengguang
  0 siblings, 1 reply; 71+ messages in thread
From: Wu Fengguang @ 2010-11-24 13:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, Nov 24, 2010 at 08:50:47PM +0800, Peter Zijlstra wrote:
> On Wed, 2010-11-24 at 20:10 +0800, Wu Fengguang wrote:
> > > > +       /*
> > > > +        * When there lots of tasks throttled in balance_dirty_pages(), they
> > > > +        * will each try to update the bandwidth for the same period, making
> > > > +        * the bandwidth drift much faster than the desired rate (as in the
> > > > +        * single dirtier case). So do some rate limiting.
> > > > +        */
> > > > +       if (jiffies - bdi->write_bandwidth_update_time < elapsed)
> > > > +               goto snapshot;
> > >
> > > Why this goto snapshot and not simply return? This is the second call
> > > (bdi_update_bandwidth equivalent).
> > 
> > Good question. The loop inside balance_dirty_pages() normally run only
> > once, however wb_writeback() may loop over and over again. If we just
> > return here, the condition
> > 
> >         (jiffies - bdi->write_bandwidth_update_time < elapsed)
> > 
> > cannot be reset, then future bdi_update_bandwidth() calls in the same
> > wb_writeback() loop will never find it OK to update the bandwidth.
> 
> But the thing is, you don't want to reset that, it might loop so fast
> you'll throttle all of them, if you keep the pre-throttle value you'll
> eventually pass, no?

It (let's name it A) only resets the _local_ vars bw_* when it's sure
by the condition

        (jiffies - bdi->write_bandwidth_update_time < elapsed)

that someone else (name B) has updated the _global_ bandwidth in the
time range we planned. So there may be some time in A's range that is
not covered by B, but sure the range is not totally bypassed without
updating the bandwidth.

> > It does assume no races between CPUs.. We may need some per-cpu based
> > estimation. 
> 
> But that multi-writer race is valid even for the balance_dirty_pages()
> call, two or more could interleave on the bw_time and bw_written
> variables.

The race will only exist in each task's local vars (their bw_* will
overlap). But the update bdi->write_bandwidth* will be safeguarded
by the above check. When the task is scheduled back, it may find
updated write_bandwidth_update_time and hence give up his estimation.
This is rather tricky..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 13:14         ` Wu Fengguang
@ 2010-11-24 13:20           ` Wu Fengguang
  2010-11-24 13:42             ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Wu Fengguang @ 2010-11-24 13:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, Nov 24, 2010 at 09:14:37PM +0800, Wu Fengguang wrote:
> On Wed, Nov 24, 2010 at 08:50:47PM +0800, Peter Zijlstra wrote:
> > On Wed, 2010-11-24 at 20:10 +0800, Wu Fengguang wrote:
> > > > > +       /*
> > > > > +        * When there lots of tasks throttled in balance_dirty_pages(), they
> > > > > +        * will each try to update the bandwidth for the same period, making
> > > > > +        * the bandwidth drift much faster than the desired rate (as in the
> > > > > +        * single dirtier case). So do some rate limiting.
> > > > > +        */
> > > > > +       if (jiffies - bdi->write_bandwidth_update_time < elapsed)
> > > > > +               goto snapshot;
> > > >
> > > > Why this goto snapshot and not simply return? This is the second call
> > > > (bdi_update_bandwidth equivalent).
> > > 
> > > Good question. The loop inside balance_dirty_pages() normally run only
> > > once, however wb_writeback() may loop over and over again. If we just
> > > return here, the condition
> > > 
> > >         (jiffies - bdi->write_bandwidth_update_time < elapsed)
> > > 
> > > cannot be reset, then future bdi_update_bandwidth() calls in the same
> > > wb_writeback() loop will never find it OK to update the bandwidth.
> > 
> > But the thing is, you don't want to reset that, it might loop so fast
> > you'll throttle all of them, if you keep the pre-throttle value you'll
> > eventually pass, no?
> 
> It (let's name it A) only resets the _local_ vars bw_* when it's sure
> by the condition
> 
>         (jiffies - bdi->write_bandwidth_update_time < elapsed)

this will be true if someone else has _done_ overlapped estimation,
otherwise it will equal:

        jiffies - bdi->write_bandwidth_update_time == elapsed

Sorry the comment needs updating.

Thanks,
Fengguang

> that someone else (name B) has updated the _global_ bandwidth in the
> time range we planned. So there may be some time in A's range that is
> not covered by B, but sure the range is not totally bypassed without
> updating the bandwidth.
> 
> > > It does assume no races between CPUs.. We may need some per-cpu based
> > > estimation. 
> > 
> > But that multi-writer race is valid even for the balance_dirty_pages()
> > call, two or more could interleave on the bw_time and bw_written
> > variables.
> 
> The race will only exist in each task's local vars (their bw_* will
> overlap). But the update bdi->write_bandwidth* will be safeguarded
> by the above check. When the task is scheduled back, it may find
> updated write_bandwidth_update_time and hence give up his estimation.
> This is rather tricky..
> 
> Thanks,
> Fengguang

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 13:20           ` Wu Fengguang
@ 2010-11-24 13:42             ` Peter Zijlstra
  2010-11-24 13:46               ` Wu Fengguang
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2010-11-24 13:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 2010-11-24 at 21:20 +0800, Wu Fengguang wrote:
> >         (jiffies - bdi->write_bandwidth_update_time < elapsed)
> 
> this will be true if someone else has _done_ overlapped estimation,
> otherwise it will equal:
> 
>         jiffies - bdi->write_bandwidth_update_time == elapsed
> 
> Sorry the comment needs updating. 

Right, but its racy as hell..

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 13:42             ` Peter Zijlstra
@ 2010-11-24 13:46               ` Wu Fengguang
  2010-11-24 14:12                 ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Wu Fengguang @ 2010-11-24 13:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, Nov 24, 2010 at 09:42:09PM +0800, Peter Zijlstra wrote:
> On Wed, 2010-11-24 at 21:20 +0800, Wu Fengguang wrote:
> > >         (jiffies - bdi->write_bandwidth_update_time < elapsed)
> > 
> > this will be true if someone else has _done_ overlapped estimation,
> > otherwise it will equal:
> > 
> >         jiffies - bdi->write_bandwidth_update_time == elapsed
> > 
> > Sorry the comment needs updating. 
> 
> Right, but its racy as hell..

Yeah, for N concurrent dirtiers, plus the background flusher, only
one is able to update write_bandwidth[_update_time]..


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 10:58   ` Peter Zijlstra
@ 2010-11-24 14:06     ` Wu Fengguang
  0 siblings, 0 replies; 71+ messages in thread
From: Wu Fengguang @ 2010-11-24 14:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, Nov 24, 2010 at 06:58:22PM +0800, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> 
> > @@ -555,8 +592,10 @@ static void balance_dirty_pages(struct a
> >  		pause = clamp_val(pause, 1, HZ/10);
> >  
> >  pause:
> > +		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
> >  		__set_current_state(TASK_INTERRUPTIBLE);
> >  		io_schedule_timeout(pause);
> > +		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
> >  
> >  		/*
> >  		 * The bdi thresh is somehow "soft" limit derived from the
> 
> So its really a two part bandwidth calculation, the first call is:
> 
>   bdi_get_bandwidth()
> 
> and the second call is:
> 
>   bdi_update_bandwidth()
> 
> Would it make sense to actually implement it with two functions instead
> of overloading the functionality of the one function?

Thanks, it's good suggestion indeed. However after looking around, I
find it hard to split it up cleanly.. To make it clear, how about this
comment update?

Thanks,
Fengguang
---

--- linux-next.orig/mm/page-writeback.c	2010-11-24 19:05:01.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-11-24 22:01:43.000000000 +0800
@@ -554,6 +554,14 @@ out:
 	return a;
 }
 
+/*
+ * This can be repeatedly called inside a long run loop, eg. by wb_writeback().
+ *
+ * On first invocation it will find *bw_written=0 and take the initial snapshot.
+ * On follow up calls it will update the bandwidth if
+ * - at least 10ms data have been collected
+ * - the bandwidth for the time range has not been updated in parallel by others
+ */
 void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				unsigned long *bw_time,
 				s64 *bw_written)
@@ -575,9 +583,12 @@ void bdi_update_write_bandwidth(struct b
 	 * When there lots of tasks throttled in balance_dirty_pages(), they
 	 * will each try to update the bandwidth for the same period, making
 	 * the bandwidth drift much faster than the desired rate (as in the
-	 * single dirtier case). So do some rate limiting.
+	 * single dirtier case).
+	 *
+	 * If someone changed bdi->write_bandwidth_update_time, he has done
+	 * overlapped estimation with us. So start the next round of estimation.
 	 */
-	if (jiffies - bdi->write_bandwidth_update_time < elapsed)
+	if (jiffies - bdi->write_bandwidth_update_time != elapsed)
 		goto snapshot;
 
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]) - *bw_written;

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 13:46               ` Wu Fengguang
@ 2010-11-24 14:12                 ` Peter Zijlstra
  2010-11-24 14:21                   ` Wu Fengguang
  2010-11-24 14:34                   ` Wu Fengguang
  0 siblings, 2 replies; 71+ messages in thread
From: Peter Zijlstra @ 2010-11-24 14:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 2010-11-24 at 21:46 +0800, Wu Fengguang wrote:
> On Wed, Nov 24, 2010 at 09:42:09PM +0800, Peter Zijlstra wrote:
> > On Wed, 2010-11-24 at 21:20 +0800, Wu Fengguang wrote:
> > > >         (jiffies - bdi->write_bandwidth_update_time < elapsed)
> > > 
> > > this will be true if someone else has _done_ overlapped estimation,
> > > otherwise it will equal:
> > > 
> > >         jiffies - bdi->write_bandwidth_update_time == elapsed
> > > 
> > > Sorry the comment needs updating. 
> > 
> > Right, but its racy as hell..
> 
> Yeah, for N concurrent dirtiers, plus the background flusher, only
> one is able to update write_bandwidth[_update_time]..

Wrong, nr_cpus are, they could all observe the old value before seeing
the update of the variable.

Why not something like the below, which keeps the stamps per bdi and
serializes on a lock (trylock, you only need a single updater at any one
time anyway):

probably got the math wrong, but the idea should be clear, you can even
add an explicit bdi_update_bandwidth_stamps() function which resets the
stamps to the current situation in order to skip periods of low
throughput (that would need to do spin_lock).

---
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 4ce34fa..de690c3 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
 
@@ -88,6 +89,11 @@ struct backing_dev_info {
 
 	struct timer_list laptop_mode_wb_timer;
 
+	spinlock_t bw_lock;
+	unsigned long bw_time_stamp;
+	unsigned long bw_write_stamp;
+	int write_bandwidth;
+
 #ifdef CONFIG_DEBUG_FS
 	struct dentry *debug_dir;
 	struct dentry *debug_stats;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 027100d..a934fe9 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -661,6 +661,11 @@ int bdi_init(struct backing_dev_info *bdi)
 	bdi->dirty_exceeded = 0;
 	err = prop_local_init_percpu(&bdi->completions);
 
+	spin_lock_init(&bdi->bw_lock);
+	bdi->bw_time_stamp = jiffies;
+	bdi->bw_write_stamp = 0;
+	bdi->write_bandwidth = 100 << (20 - PAGE_SHIFT); /* 100 MB/s */
+
 	if (err) {
 err:
 		while (i--)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b840afa..f3f5c24 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -219,6 +219,7 @@ int dirty_bytes_handler(struct ctl_table *table, int write,
  */
 static inline void __bdi_writeout_inc(struct backing_dev_info *bdi)
 {
+	__inc_bdi_state(bdi, BDI_WRITTEN);
 	__prop_inc_percpu_max(&vm_completions, &bdi->completions,
 			      bdi->max_prop_frac);
 }
@@ -238,6 +239,35 @@ void task_dirty_inc(struct task_struct *tsk)
 	prop_inc_single(&vm_dirties, &tsk->dirties);
 }
 
+void bdi_update_write_bandwidth(struct backing_dev_info *bdi)
+{
+	unsigned long time_now, write_now;
+	long time_delta, write_delta;
+	long bw;
+
+	if (!spin_try_lock(&bdi->bw_lock))
+		return;
+
+	write_now = bdi_stat(bdi, BDI_WRITTEN);
+	time_now = jiffies;
+
+	write_delta = write_now - bdi->bw_write_stamp;
+	time_delta = time_now - bdi->bw_time_stamp;
+
+	/* rate-limit, only update once every 100ms */
+	if (time_delta < HZ/10 || !write_delta)
+		goto unlock;
+
+	bdi->bw_write_stamp = write_now;
+	bdi->bw_time_stamp = time_now;
+
+	bw = write_delta * HZ / time_delta;
+	bdi->write_bandwidth = (bdi->write_bandwidth + bw + 3) / 4;
+
+unlock:
+	spin_unlock(&bdi->bw_lock);
+}
+
 /*
  * Obtain an accurate fraction of the BDI's portion.
  */


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 14:12                 ` Peter Zijlstra
@ 2010-11-24 14:21                   ` Wu Fengguang
  2010-11-24 14:31                     ` Peter Zijlstra
  2010-11-24 14:34                   ` Wu Fengguang
  1 sibling, 1 reply; 71+ messages in thread
From: Wu Fengguang @ 2010-11-24 14:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, Nov 24, 2010 at 10:12:33PM +0800, Peter Zijlstra wrote:
> On Wed, 2010-11-24 at 21:46 +0800, Wu Fengguang wrote:
> > On Wed, Nov 24, 2010 at 09:42:09PM +0800, Peter Zijlstra wrote:
> > > On Wed, 2010-11-24 at 21:20 +0800, Wu Fengguang wrote:
> > > > >         (jiffies - bdi->write_bandwidth_update_time < elapsed)
> > > > 
> > > > this will be true if someone else has _done_ overlapped estimation,
> > > > otherwise it will equal:
> > > > 
> > > >         jiffies - bdi->write_bandwidth_update_time == elapsed
> > > > 
> > > > Sorry the comment needs updating. 
> > > 
> > > Right, but its racy as hell..
> > 
> > Yeah, for N concurrent dirtiers, plus the background flusher, only
> > one is able to update write_bandwidth[_update_time]..
> 
> Wrong, nr_cpus are, they could all observe the old value before seeing
> the update of the variable.

Yes, that's what I meant to do it "per-cpu" in the previous email.

> Why not something like the below, which keeps the stamps per bdi and
> serializes on a lock (trylock, you only need a single updater at any one
> time anyway):

Hmm, but why not avoid locking at all?  With per-cpu bandwidth vars,
each CPU will see slightly different bandwidth, but that should be
close enough and not a big problem.

> probably got the math wrong, but the idea should be clear, you can even
> add an explicit bdi_update_bandwidth_stamps() function which resets the
> stamps to the current situation in order to skip periods of low
> throughput (that would need to do spin_lock).
> 
> ---
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 4ce34fa..de690c3 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
>  enum bdi_stat_item {
>  	BDI_RECLAIMABLE,
>  	BDI_WRITEBACK,
> +	BDI_WRITTEN,
>  	NR_BDI_STAT_ITEMS
>  };
>  
> @@ -88,6 +89,11 @@ struct backing_dev_info {
>  
>  	struct timer_list laptop_mode_wb_timer;
>  
> +	spinlock_t bw_lock;
> +	unsigned long bw_time_stamp;
> +	unsigned long bw_write_stamp;
> +	int write_bandwidth;
> +
>  #ifdef CONFIG_DEBUG_FS
>  	struct dentry *debug_dir;
>  	struct dentry *debug_stats;
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 027100d..a934fe9 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -661,6 +661,11 @@ int bdi_init(struct backing_dev_info *bdi)
>  	bdi->dirty_exceeded = 0;
>  	err = prop_local_init_percpu(&bdi->completions);
>  
> +	spin_lock_init(&bdi->bw_lock);
> +	bdi->bw_time_stamp = jiffies;
> +	bdi->bw_write_stamp = 0;
> +	bdi->write_bandwidth = 100 << (20 - PAGE_SHIFT); /* 100 MB/s */
> +
>  	if (err) {
>  err:
>  		while (i--)
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index b840afa..f3f5c24 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -219,6 +219,7 @@ int dirty_bytes_handler(struct ctl_table *table, int write,
>   */
>  static inline void __bdi_writeout_inc(struct backing_dev_info *bdi)
>  {
> +	__inc_bdi_state(bdi, BDI_WRITTEN);
>  	__prop_inc_percpu_max(&vm_completions, &bdi->completions,
>  			      bdi->max_prop_frac);
>  }
> @@ -238,6 +239,35 @@ void task_dirty_inc(struct task_struct *tsk)
>  	prop_inc_single(&vm_dirties, &tsk->dirties);
>  }
>  
> +void bdi_update_write_bandwidth(struct backing_dev_info *bdi)
> +{
> +	unsigned long time_now, write_now;
> +	long time_delta, write_delta;
> +	long bw;
> +
> +	if (!spin_try_lock(&bdi->bw_lock))
> +		return;

spin_try_lock is good, however is still global state and risks
cacheline bouncing..

> +	write_now = bdi_stat(bdi, BDI_WRITTEN);
> +	time_now = jiffies;
> +
> +	write_delta = write_now - bdi->bw_write_stamp;
> +	time_delta = time_now - bdi->bw_time_stamp;
> +
> +	/* rate-limit, only update once every 100ms */
> +	if (time_delta < HZ/10 || !write_delta)
> +		goto unlock;
> +
> +	bdi->bw_write_stamp = write_now;
> +	bdi->bw_time_stamp = time_now;
> +
> +	bw = write_delta * HZ / time_delta;
> +	bdi->write_bandwidth = (bdi->write_bandwidth + bw + 3) / 4;
> +
> +unlock:
> +	spin_unlock(&bdi->bw_lock);
> +}
> +
>  /*
>   * Obtain an accurate fraction of the BDI's portion.
>   */
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 14:21                   ` Wu Fengguang
@ 2010-11-24 14:31                     ` Peter Zijlstra
  2010-11-24 14:38                       ` Wu Fengguang
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2010-11-24 14:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 2010-11-24 at 22:21 +0800, Wu Fengguang wrote:
> 
> Hmm, but why not avoid locking at all?  With per-cpu bandwidth vars,
> each CPU will see slightly different bandwidth, but that should be
> close enough and not a big problem.

I don't think so, on a large enough machine some cpus might hardly ever
use a particular BDI and hence get very stale data.

Also, it increases the memory footprint of the whole solution.

> > +void bdi_update_write_bandwidth(struct backing_dev_info *bdi)
> > +{
> > +     unsigned long time_now, write_now;
> > +     long time_delta, write_delta;
> > +     long bw;
> > +
> > +     if (!spin_try_lock(&bdi->bw_lock))
> > +             return;
> 
> spin_try_lock is good, however is still global state and risks
> cacheline bouncing.. 

If there are many concurrent writers to the BDI I don't think this is
going to be the top sore spot, once it is we can think of something
else.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 14:12                 ` Peter Zijlstra
  2010-11-24 14:21                   ` Wu Fengguang
@ 2010-11-24 14:34                   ` Wu Fengguang
  1 sibling, 0 replies; 71+ messages in thread
From: Wu Fengguang @ 2010-11-24 14:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

btw, I noticed that the same dd is constantly running from different
CPUs.

It's multi-core CPUs on the same socket, so should not be a big
problem. I'll test NUMA setup later.

This trace is for single dd case. 

dd-2893  [005]  3535.182111: balance_dirty_pages: bdi=8:0
dd-2893  [001]  3535.230125: balance_dirty_pages: bdi=8:0
dd-2893  [005]  3535.274805: balance_dirty_pages: bdi=8:0
dd-2893  [005]  3535.289337: balance_dirty_pages: bdi=8:0
dd-2893  [005]  3535.497863: balance_dirty_pages: bdi=8:0
dd-2893  [005]  3535.585510: balance_dirty_pages: bdi=8:0
dd-2893  [005]  3535.622001: balance_dirty_pages: bdi=8:0
dd-2893  [005]  3535.645003: balance_dirty_pages: bdi=8:0
dd-2893  [001]  3535.659603: balance_dirty_pages: bdi=8:0
dd-2893  [001]  3535.669497: balance_dirty_pages: bdi=8:0
dd-2893  [001]  3535.731294: balance_dirty_pages: bdi=8:0
dd-2893  [001]  3535.785879: balance_dirty_pages: bdi=8:0
dd-2893  [001]  3535.827724: balance_dirty_pages: bdi=8:0
dd-2893  [001]  3535.853108: balance_dirty_pages: bdi=8:0
dd-2893  [001]  3535.875667: balance_dirty_pages: bdi=8:0
dd-2893  [005]  3535.895731: balance_dirty_pages: bdi=8:0
dd-2893  [005]  3535.914920: balance_dirty_pages: bdi=8:0
dd-2893  [005]  3535.925765: balance_dirty_pages: bdi=8:0
dd-2893  [005]  3535.935545: balance_dirty_pages: bdi=8:0
dd-2893  [001]  3536.015888: balance_dirty_pages: bdi=8:0
dd-2893  [001]  3536.085571: balance_dirty_pages: bdi=8:0

Here is another run (1-dd):

dd-3014  [001]  1248.388001: balance_dirty_pages: bdi=8:0 bdi_dirty=122640 avg_dirty=122643 
dd-3014  [002]  1248.465999: balance_dirty_pages: bdi=8:0 bdi_dirty=122400 avg_dirty=122625 
dd-3014  [007]  1248.531451: balance_dirty_pages: bdi=8:0 bdi_dirty=122240 avg_dirty=122600 
dd-3014  [007]  1248.599260: balance_dirty_pages: bdi=8:0 bdi_dirty=122000 avg_dirty=122561 
dd-3014  [003]  1248.667152: balance_dirty_pages: bdi=8:0 bdi_dirty=120840 avg_dirty=122447 
dd-3014  [007]  1248.734848: balance_dirty_pages: bdi=8:0 bdi_dirty=120640 avg_dirty=122328 
dd-3014  [007]  1248.798652: balance_dirty_pages: bdi=8:0 bdi_dirty=120440 avg_dirty=122210 
dd-3014  [007]  1248.862456: balance_dirty_pages: bdi=8:0 bdi_dirty=120240 avg_dirty=122087 
dd-3014  [003]  1248.926280: balance_dirty_pages: bdi=8:0 bdi_dirty=120040 avg_dirty=121960 
dd-3014  [005]  1248.986091: balance_dirty_pages: bdi=8:0 bdi_dirty=119760 avg_dirty=121832 
dd-3014  [005]  1249.045841: balance_dirty_pages: bdi=8:0 bdi_dirty=119600 avg_dirty=121702 

And this 100 dd case looks much better (however it's not always the case):

dd-2775  [007]   454.844907: balance_dirty_pages: bdi=8:0 bdi_dirty=104040 avg_dirty=409468930 
dd-2775  [007]   455.048108: balance_dirty_pages: bdi=8:0 bdi_dirty=103800 avg_dirty=409676055 
dd-2775  [007]   455.252347: balance_dirty_pages: bdi=8:0 bdi_dirty=103720 avg_dirty=409676055 
dd-2775  [007]   455.455538: balance_dirty_pages: bdi=8:0 bdi_dirty=103800 avg_dirty=409676055 
dd-2775  [007]   455.658597: balance_dirty_pages: bdi=8:0 bdi_dirty=103880 avg_dirty=409676055 
dd-2775  [007]   455.862631: balance_dirty_pages: bdi=8:0 bdi_dirty=103760 avg_dirty=410276387 
dd-2775  [007]   456.068149: balance_dirty_pages: bdi=8:0 bdi_dirty=103800 avg_dirty=410276387 
dd-2775  [007]   456.266729: balance_dirty_pages: bdi=8:0 bdi_dirty=103760 avg_dirty=410833633 
dd-2775  [007]   456.470184: balance_dirty_pages: bdi=8:0 bdi_dirty=103840 avg_dirty=410833633 
dd-2775  [007]   456.668919: balance_dirty_pages: bdi=8:0 bdi_dirty=103960 avg_dirty=410833633 
dd-2775  [007]   456.872522: balance_dirty_pages: bdi=8:0 bdi_dirty=103880 avg_dirty=411001977 
dd-2775  [007]   457.070753: balance_dirty_pages: bdi=8:0 bdi_dirty=104040 avg_dirty=411001977 
dd-2775  [007]   457.275970: balance_dirty_pages: bdi=8:0 bdi_dirty=103960 avg_dirty=411227747 
dd-2775  [007]   457.473736: balance_dirty_pages: bdi=8:0 bdi_dirty=103920 avg_dirty=411534633 
dd-2775  [007]   457.677204: balance_dirty_pages: bdi=8:0 bdi_dirty=103920 avg_dirty=412195106 
dd-2775  [007]   457.881156: balance_dirty_pages: bdi=8:0 bdi_dirty=104040 avg_dirty=412195106 
dd-2775  [007]   458.085173: balance_dirty_pages: bdi=8:0 bdi_dirty=103920 avg_dirty=412295260 
dd-2775  [007]   458.285146: balance_dirty_pages: bdi=8:0 bdi_dirty=104000 avg_dirty=412295260 
dd-2775  [007]   458.490109: balance_dirty_pages: bdi=8:0 bdi_dirty=103960 avg_dirty=412789184 
dd-2775  [007]   458.692230: balance_dirty_pages: bdi=8:0 bdi_dirty=103960 avg_dirty=412789184 

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 14:31                     ` Peter Zijlstra
@ 2010-11-24 14:38                       ` Wu Fengguang
  0 siblings, 0 replies; 71+ messages in thread
From: Wu Fengguang @ 2010-11-24 14:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, Nov 24, 2010 at 10:31:57PM +0800, Peter Zijlstra wrote:
> On Wed, 2010-11-24 at 22:21 +0800, Wu Fengguang wrote:
> > 
> > Hmm, but why not avoid locking at all?  With per-cpu bandwidth vars,
> > each CPU will see slightly different bandwidth, but that should be
> > close enough and not a big problem.
> 
> I don't think so, on a large enough machine some cpus might hardly ever
> use a particular BDI and hence get very stale data.

Good point!

> Also, it increases the memory footprint of the whole solution.

Yeah, maybe not a good trade off.

> > > +void bdi_update_write_bandwidth(struct backing_dev_info *bdi)
> > > +{
> > > +     unsigned long time_now, write_now;
> > > +     long time_delta, write_delta;
> > > +     long bw;
> > > +
> > > +     if (!spin_try_lock(&bdi->bw_lock))
> > > +             return;
> > 
> > spin_try_lock is good, however is still global state and risks
> > cacheline bouncing.. 
> 
> If there are many concurrent writers to the BDI I don't think this is
> going to be the top sore spot, once it is we can think of something
> else.

When there are lots of concurrent writers, we'll target at ~100ms
pause time, hence the update frequency will be lowered accordingly.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
       [not found]               ` <20101201133818.GA13377@localhost>
@ 2010-12-01 23:03                 ` Andrew Morton
  2010-12-02  1:56                   ` Wu Fengguang
  2010-12-05 16:14                 ` Wu Fengguang
  1 sibling, 1 reply; 71+ messages in thread
From: Andrew Morton @ 2010-12-01 23:03 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, Theodore Ts'o, Chris Mason, Dave Chinner,
	Jan Kara, Jens Axboe, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Christoph Hellwig, linux-mm, linux-fsdevel, LKML

On Wed, 1 Dec 2010 21:38:18 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> It shows that
> 
> 1) io_schedule_timeout(200ms) always return immediately for iostat,
>    forming a busy loop.  How can this happen? When iostat received
>    some signal? Then we may have to break out of the loop on catching
>    signals. Note that I already have
>                 if (fatal_signal_pending(current))
>                         break;
>    in the balance_dirty_pages() loop. Obviously that's not enough.

Presumably the calling task has singal_pending().

Using TASK_INTERRUPTIBLE in balance_dirty_pages() seems wrong.  If it's
going to do that then it must break out if signal_pending(), otherwise
it's pretty much guaranteed to degenerate into a busywait loop.  Plus
we *do* want these processes to appear in D state and to contribute to
load average.

So it should be TASK_UNINTERRUPTIBLE.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-12-01 23:03                 ` Andrew Morton
@ 2010-12-02  1:56                   ` Wu Fengguang
  0 siblings, 0 replies; 71+ messages in thread
From: Wu Fengguang @ 2010-12-02  1:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Theodore Ts'o, Chris Mason, Dave Chinner,
	Jan Kara, Jens Axboe, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Christoph Hellwig, linux-mm, linux-fsdevel, LKML

On Thu, Dec 02, 2010 at 07:03:33AM +0800, Andrew Morton wrote:
> On Wed, 1 Dec 2010 21:38:18 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > It shows that
> > 
> > 1) io_schedule_timeout(200ms) always return immediately for iostat,
> >    forming a busy loop.  How can this happen? When iostat received
> >    some signal? Then we may have to break out of the loop on catching
> >    signals. Note that I already have
> >                 if (fatal_signal_pending(current))
> >                         break;
> >    in the balance_dirty_pages() loop. Obviously that's not enough.
> 
> Presumably the calling task has singal_pending().
> 
> Using TASK_INTERRUPTIBLE in balance_dirty_pages() seems wrong.  If it's
> going to do that then it must break out if signal_pending(), otherwise
> it's pretty much guaranteed to degenerate into a busywait loop.

Right. It seems not rewarding enough to check signal_pending().  We've
already been able to response to signals much faster than before
(which takes more time to block in get_request_wait()).

> Plus we *do* want these processes to appear in D state and to
> contribute to load average.
> 
> So it should be TASK_UNINTERRUPTIBLE.

Fair enough. I do missed the D state (without the long wait :).
Here is the patch.

Thanks,
Fengguang
---
Subject: writeback: do uninterruptible sleep in balance_dirty_pages()
Date: Thu Dec 02 09:31:19 CST 2010

Using TASK_INTERRUPTIBLE in balance_dirty_pages() seems wrong.  If it's
going to do that then it must break out if signal_pending(), otherwise
it's pretty much guaranteed to degenerate into a busywait loop.  Plus
we *do* want these processes to appear in D state and to contribute to
load average.

So it should be TASK_UNINTERRUPTIBLE.                 -- Andrew Morton

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-02 09:30:29.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-02 09:30:34.000000000 +0800
@@ -636,7 +636,7 @@ pause:
 					  pages_dirtied,
 					  pause);
 		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
-		__set_current_state(TASK_INTERRUPTIBLE);
+		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
 		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
       [not found]               ` <20101201133818.GA13377@localhost>
  2010-12-01 23:03                 ` Andrew Morton
@ 2010-12-05 16:14                 ` Wu Fengguang
  2010-12-06  2:42                   ` Ted Ts'o
  1 sibling, 1 reply; 71+ messages in thread
From: Wu Fengguang @ 2010-12-05 16:14 UTC (permalink / raw)
  To: Peter Zijlstra, Andrew Morton
  Cc: Theodore Ts'o, Chris Mason, Dave Chinner, Jan Kara,
	Jens Axboe, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Christoph Hellwig, linux-mm, linux-fsdevel, LKML, Tang, Feng,
	linux-ext4

[-- Attachment #1: Type: text/plain, Size: 3733 bytes --]

On Wed, Dec 01, 2010 at 09:38:18PM +0800, Wu Fengguang wrote:
> [restore CC list for new findings]
> 
> On Wed, Dec 01, 2010 at 06:39:25AM +0800, Peter Zijlstra wrote:
> > On Tue, 2010-11-30 at 23:35 +0100, Peter Zijlstra wrote:
> > > On Tue, 2010-11-30 at 12:37 +0800, Wu Fengguang wrote:
> > > > On Tue, Nov 30, 2010 at 04:53:33AM +0800, Peter Zijlstra wrote:
> > > > > On Mon, 2010-11-29 at 23:17 +0800, Wu Fengguang wrote:
> > > > > > Hi Peter,
> > > > > >
> > > > > > I'm drawing funny graphs to track the writeback dynamics :)
> > > > > >
> > > > > > In the attached graphs, I find abnormals in dirty-pages-3000.png and
> > > > > > dirty-pages-200.png.  The task limit is what's returned by
> > > > > > task_dirty_limit(), which should be very stable. However from the
> > > > > > graph it seems the task weight (numerator/denominator) will suddenly
> > > > > > drop to near 0 on every 9-10 seconds.  Do you have immediate insight
> > > > > > on what's going on? If not, I'm going to do some tracing to track down
> > > > > > how the numbers change over time.
> > > > >
> > > > > No immediate thoughts there.. I need to look through the math again, but
> > > > > I'm kinda swamped atm. (and my primary dev machine had its disk die this
> > > > > morning). I'll try and get around to it soon..
> > > >
> > > > Peter, I did a simple debug patch (attached) and collected these
> > > > numbers. I noticed that at the "task_weight=27%" and "task_weight=14%"
> > > > lines, "period" increases, "num" is decreased while "den" is still
> > > > increasing.
> > > >
> > > > num=db2e den=e8c0 period=3f8000 shift=10
> > > > num=e04c den=ede0 period=3f8000 shift=10
> > > > num=e56a den=f300 period=3f8000 shift=10
> > >
> > > > num=3e78 den=e400 period=408000 shift=10
> > >
> > > > num=1341 den=8900 period=418000 shift=10
> > > > num=185f den=8e20 period=418000 shift=10
> > > > num=1d7d den=9340 period=418000 shift=10
> > > > num=229b den=9860 period=418000 shift=10
> > > > num=27b9 den=9da0 period=418000 shift=10
> > > > num=2cd7 den=a2c0 period=418000 shift=10
> > >
> > >
> > > This looks sane.. the period indicates someone else was dirtying lots of
> > > pages. Every time the period increases (its shifted right by shift) we
> > > divide the events (num) by 2.
> >
> > Its actually shifted left by shift-1.. see prop_norm_single(), which
> > would make the below:
> >
> > > So the increment from 3f8000 to 408000 is 4064 to 4128, or 64, that
> > > should reset events to 0, seeing that it didn't means it got incremented
> > > as well.
> > >
> > > Funny enough, the second jump is again exactly 64..
> > >
> > > Anyway, as you can see, den increases as long as period stays constant,
> > > it takes a dip when period increments.
> >
> > two steps of 128, which is terribly large.
> >
> > then again, a period of 512 pages is very very small.
> 
> Peter, I also collected prop_norm_single() traces, hope it helps.
> 
> Again, you can find time points when the task limit suddenly skip high
> in graphs "dirty-pages*.png", and then find the corresponding data
> point in file "trace". Sorry I compute something wrong: the "ratio"
> field in the trace data is always 0, please just ignore them.
> 
> I noticed that jbd2/sda8-8-2811 dirtied lots of pages, perhaps by
> ext4_bio_write_page(). This should happen only on -ENOMEM.  I also

Ah I seem to find the root cause. See the attached graphs. Ext4 should
be calling redirty_page_for_writepage() to redirty ~300MB pages on
every ~10s. The redirties happen in big bursts, so not surprisingly
the dd task's dirty weight will suddenly drop to 0.

It should be the same ext4 issue discussed here:

        http://www.spinics.net/lists/linux-fsdevel/msg39555.html

Thanks,
Fengguang

[-- Attachment #2: vmstat-written-300.png --]
[-- Type: image/png, Size: 44152 bytes --]

[-- Attachment #3: vmstat-written.png --]
[-- Type: image/png, Size: 40715 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-12-05 16:14                 ` Wu Fengguang
@ 2010-12-06  2:42                   ` Ted Ts'o
  2010-12-06  9:52                     ` Dmitry
  0 siblings, 1 reply; 71+ messages in thread
From: Ted Ts'o @ 2010-12-06  2:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, Andrew Morton, Chris Mason, Dave Chinner,
	Jan Kara, Jens Axboe, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Christoph Hellwig, linux-mm, linux-fsdevel, LKML, Tang, Feng,
	linux-ext4

On Mon, Dec 06, 2010 at 12:14:35AM +0800, Wu Fengguang wrote:
> 
> Ah I seem to find the root cause. See the attached graphs. Ext4 should
> be calling redirty_page_for_writepage() to redirty ~300MB pages on
> every ~10s. The redirties happen in big bursts, so not surprisingly
> the dd task's dirty weight will suddenly drop to 0.
> 
> It should be the same ext4 issue discussed here:
> 
>         http://www.spinics.net/lists/linux-fsdevel/msg39555.html

Yeah, unfortunately the fix suggested isn't the right one.

The right fix is going to involve making much more radical changes to
the ext4 write submission path, which is on my todo queue.  For now,
if people don't like these nasty writeback dynamics, my suggestion for
now is to mount the filesystem data=writeback.

This is basically the clean equivalent of the patch suggested by Feng
Tang in his e-mail referenced above.  Given that ext4 uses delayed
allocation, most of the time unwritten blocks are not allocated, and
so stale data isn't exposed.

The case which you're seeing here is where both the jbd2 data=order
forced writeback is colliding with the writeback thread, and
unfortunately, the forced writeback in the jbd2 layer is done in an
extremely inefficient manner.  So data=writeback is the workaround,
and unlike ext3, it's not a serious security leak.  It is possible for
some stale data to get exposed if you get unlucky when you crash,
though, so there is a potential for some security exposure.

The long-term solution to this problem is to rework the ext4 writeback
path so that we write the data blocks when they are newly allocated,
and then only update fs metadata once they are written.  As I said,
it's on my queue.  Until then, the only suggestion I can give folks is
data=writeback.

						- Ted

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-12-06  2:42                   ` Ted Ts'o
@ 2010-12-06  9:52                     ` Dmitry
  2010-12-06 12:34                       ` Ted Ts'o
  0 siblings, 1 reply; 71+ messages in thread
From: Dmitry @ 2010-12-06  9:52 UTC (permalink / raw)
  To: Ted Ts'o, Wu Fengguang
  Cc: Peter Zijlstra, Andrew Morton, Chris Mason, Dave Chinner,
	Jan Kara, Jens Axboe, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Christoph Hellwig, linux-mm, linux-fsdevel, LKML, Tang, Feng,
	linux-ext4

On Sun, 5 Dec 2010 21:42:31 -0500, Ted Ts'o <tytso@mit.edu> wrote:
> On Mon, Dec 06, 2010 at 12:14:35AM +0800, Wu Fengguang wrote:
> > 
> > Ah I seem to find the root cause. See the attached graphs. Ext4 should
> > be calling redirty_page_for_writepage() to redirty ~300MB pages on
> > every ~10s. The redirties happen in big bursts, so not surprisingly
> > the dd task's dirty weight will suddenly drop to 0.
> > 
> > It should be the same ext4 issue discussed here:
> > 
> >         http://www.spinics.net/lists/linux-fsdevel/msg39555.html
> 
> Yeah, unfortunately the fix suggested isn't the right one.
> 
> The right fix is going to involve making much more radical changes to
> the ext4 write submission path, which is on my todo queue.  For now,
> if people don't like these nasty writeback dynamics, my suggestion for
> now is to mount the filesystem data=writeback.
> 
> This is basically the clean equivalent of the patch suggested by Feng
> Tang in his e-mail referenced above.  Given that ext4 uses delayed
> allocation, most of the time unwritten blocks are not allocated, and
> so stale data isn't exposed.
May be it is reasonable to introduce new mount option which control
dynamic delalloc on/off behavior for example like this:
0) -odelalloc=off : analog of nodelalloc
1) -odelalloc=normal : Default mode (disable delalloc if close to full fs)
2) -odelalloc=force  : delalloc mode always enabled, so we have to do
                     writeback more aggressive in case of ENOSPC.

So one can force delalloc and can safely use this writeback mode in 
multi-user environment. Openvz already has this. I'll prepare the patch
if you are interesting in that feature?
> 
> The case which you're seeing here is where both the jbd2 data=order
> forced writeback is colliding with the writeback thread, and
> unfortunately, the forced writeback in the jbd2 layer is done in an
> extremely inefficient manner.  So data=writeback is the workaround,
> and unlike ext3, it's not a serious security leak.  It is possible for
> some stale data to get exposed if you get unlucky when you crash,
> though, so there is a potential for some security exposure.
> 
> The long-term solution to this problem is to rework the ext4 writeback
> path so that we write the data blocks when they are newly allocated,
> and then only update fs metadata once they are written.  As I said,
> it's on my queue.  Until then, the only suggestion I can give folks is
> data=writeback.
> 
> 						- Ted
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-12-06  9:52                     ` Dmitry
@ 2010-12-06 12:34                       ` Ted Ts'o
  0 siblings, 0 replies; 71+ messages in thread
From: Ted Ts'o @ 2010-12-06 12:34 UTC (permalink / raw)
  To: Dmitry
  Cc: Wu Fengguang, Peter Zijlstra, Andrew Morton, Chris Mason,
	Dave Chinner, Jan Kara, Jens Axboe, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Christoph Hellwig, linux-mm, linux-fsdevel,
	LKML, Tang, Feng, linux-ext4

On Mon, Dec 06, 2010 at 12:52:21PM +0300, Dmitry wrote:
> May be it is reasonable to introduce new mount option which control
> dynamic delalloc on/off behavior for example like this:
> 0) -odelalloc=off : analog of nodelalloc
> 1) -odelalloc=normal : Default mode (disable delalloc if close to full fs)
> 2) -odelalloc=force  : delalloc mode always enabled, so we have to do
>                      writeback more aggressive in case of ENOSPC.
> 
> So one can force delalloc and can safely use this writeback mode in 
> multi-user environment. Openvz already has this. I'll prepare the patch
> if you are interesting in that feature?

Yeah, I'd really rather not do that.  There are significant downsides
with your proposed odelalloc=force mode.  One of which is that we
could run out of space and not notice.  If the application doesn't
call fsync() and check the return value, and simply closes()'s the
file and then exits, when the writeback threads do get around to
writing the file, the block allocation could fail, and oops, data gets
lost.  There's a _reason_ why we disable delalloc when we're close to
a full fs.  The only alternative is to super conservative when doing
your block reservation calculations, and in that case, you end up
returning ENOSPC far too soon.

						- Ted

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2010-12-06 12:36 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-17  4:27 [PATCH 00/13] IO-less dirty throttling v2 Wu Fengguang
2010-11-17  4:27 ` [PATCH 01/13] writeback: IO-less balance_dirty_pages() Wu Fengguang
2010-11-17 10:34   ` Minchan Kim
2010-11-22  2:01     ` Wu Fengguang
2010-11-17 23:08   ` Andrew Morton
2010-11-18 13:04   ` Peter Zijlstra
2010-11-18 13:26     ` Wu Fengguang
2010-11-18 13:40       ` Peter Zijlstra
2010-11-18 14:02         ` Wu Fengguang
     [not found]     ` <20101129151719.GA30590@localhost>
     [not found]       ` <1291064013.32004.393.camel@laptop>
     [not found]         ` <20101130043735.GA22947@localhost>
     [not found]           ` <1291156522.32004.1359.camel@laptop>
     [not found]             ` <1291156765.32004.1365.camel@laptop>
     [not found]               ` <20101201133818.GA13377@localhost>
2010-12-01 23:03                 ` Andrew Morton
2010-12-02  1:56                   ` Wu Fengguang
2010-12-05 16:14                 ` Wu Fengguang
2010-12-06  2:42                   ` Ted Ts'o
2010-12-06  9:52                     ` Dmitry
2010-12-06 12:34                       ` Ted Ts'o
2010-11-17  4:27 ` [PATCH 02/13] writeback: consolidate variable names in balance_dirty_pages() Wu Fengguang
2010-11-17  4:27 ` [PATCH 03/13] writeback: per-task rate limit on balance_dirty_pages() Wu Fengguang
2010-11-17 14:39   ` Wu Fengguang
2010-11-24 10:23   ` Peter Zijlstra
2010-11-24 10:43     ` Wu Fengguang
2010-11-24 10:49       ` Peter Zijlstra
2010-11-17  4:27 ` [PATCH 04/13] writeback: prevent duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
2010-11-17  4:27 ` [PATCH 05/13] writeback: account per-bdi accumulated written pages Wu Fengguang
2010-11-24 10:26   ` Peter Zijlstra
2010-11-24 10:44     ` Wu Fengguang
2010-11-17  4:27 ` [PATCH 06/13] writeback: bdi write bandwidth estimation Wu Fengguang
2010-11-17 23:08   ` Andrew Morton
2010-11-17 23:24     ` Peter Zijlstra
2010-11-17 23:38       ` Andrew Morton
2010-11-17 23:43         ` Peter Zijlstra
2010-11-18  6:51     ` Wu Fengguang
2010-11-24 10:58   ` Peter Zijlstra
2010-11-24 14:06     ` Wu Fengguang
2010-11-24 11:05   ` Peter Zijlstra
2010-11-24 12:10     ` Wu Fengguang
2010-11-24 12:50       ` Peter Zijlstra
2010-11-24 13:14         ` Wu Fengguang
2010-11-24 13:20           ` Wu Fengguang
2010-11-24 13:42             ` Peter Zijlstra
2010-11-24 13:46               ` Wu Fengguang
2010-11-24 14:12                 ` Peter Zijlstra
2010-11-24 14:21                   ` Wu Fengguang
2010-11-24 14:31                     ` Peter Zijlstra
2010-11-24 14:38                       ` Wu Fengguang
2010-11-24 14:34                   ` Wu Fengguang
2010-11-17  4:27 ` [PATCH 07/13] writeback: show bdi write bandwidth in debugfs Wu Fengguang
2010-11-17  4:27 ` [PATCH 08/13] writeback: quit throttling when bdi dirty pages dropped low Wu Fengguang
2010-11-24 11:13   ` Peter Zijlstra
2010-11-24 12:30     ` Wu Fengguang
2010-11-24 12:46       ` Peter Zijlstra
2010-11-24 12:59         ` Wu Fengguang
2010-11-17  4:27 ` [PATCH 09/13] writeback: reduce per-bdi dirty threshold ramp up time Wu Fengguang
2010-11-24 11:15   ` Peter Zijlstra
2010-11-24 12:39     ` Wu Fengguang
2010-11-24 12:56       ` Peter Zijlstra
2010-11-17  4:27 ` [PATCH 10/13] writeback: make reasonable gap between the dirty/background thresholds Wu Fengguang
2010-11-24 11:18   ` Peter Zijlstra
2010-11-24 12:48     ` Wu Fengguang
2010-11-17  4:27 ` [PATCH 11/13] writeback: scale down max throttle bandwidth on concurrent dirtiers Wu Fengguang
2010-11-17  4:27 ` [PATCH 12/13] writeback: add trace event for balance_dirty_pages() Wu Fengguang
2010-11-17  4:41   ` Wu Fengguang
2010-11-17  4:27 ` [PATCH 13/13] writeback: make nr_to_write a per-file limit Wu Fengguang
2010-11-17 23:03 ` [PATCH 00/13] IO-less dirty throttling v2 Andrew Morton
2010-11-18  2:06   ` Dave Chinner
2010-11-18  2:09     ` Andrew Morton
2010-11-18  3:21       ` Dave Chinner
2010-11-18  3:34         ` Andrew Morton
2010-11-18  7:27           ` Dave Chinner
2010-11-18  7:33             ` Andrew Morton
2010-11-19  3:11               ` Dave Chinner
2010-11-24 11:12       ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).