All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/35] IO-less dirty throttling v4
@ 2010-12-13 14:46 ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	Wu Fengguang, linux-mm, linux-fsdevel, LKML


Andrew,

I'm glad to release this extensively tested v4 IO-less dirty throttling
patchset. It's based on 2.6.37-rc5 and Jan's sync livelock patches.

Given its trickiness and possibility of side effects, independent tests
are highly welcome. Here is the git tree for easy access

git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v4

Andrew, I followed your suggestion to add some trace points, and goes further
to write scripts to do automated tests and to visualize the collected trace,
iostat and vmstat data. The help is tremendous. The tests and data analyzes
pave way to many fixes and algorithm improvements.

It still took long time. The most challenging tasks are the fluctuations on
100+ dd and on NFS, and various imperfections in the control system and in
many filesystems. I'd say I won't be able to go this far without the help of
the pretty graphs. And I believe they'll continue to make future maintenance
easy. To identify problems reported by the end users, just ask for the traces,
I'll then turn them into graphs and quickly get an overview of the problem.

The most up-to-date graphs and the corresponding scripts are uploaded to

	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests

Here you may find and compare test results for this patchset (2.6.37-rc5+) and
for vanilla kernel (2.6.37-rc5). Filesystem developers may be interested
to take a look at the dynamics.

The control algorithms are generally doing good in the recent graphs.
There are regular fluctuations of the dirty pages number, however they
are mostly originated from underneath: the low level is reporting IO
completion in units of 1MB, 32MB or even more, leading to sudden drops
of the dirty pages.

The tests cover the common scenarios

- ext2, ext3, ext4, xfs, btrfs, nfs
- 256M, 512M, 3G, 16G memory sizes
- single disk and 12-disk array
- 1, 2, 10, 100, 1000 concurrent dd's

They disclose lots of imperfections and bugs of
1) this patchset
2) file system not working well with the new paradigm 
3) file system problems also exist in vanilla kernel

I managed to fix case (1) and most of (2) and report (3).
Below are some interesting graphs illustrating the problems.

BTRFS

case (3) problem, nr_dirty going all the way down to 0, fixed by
[PATCH 38/47] btrfs: wait on too many nr_async_bios
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1K-8p-2953M-2.6.37-rc3+-2010-11-30-17/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5-2010-12-10-21-23/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-08-21-30/vmstat-dirty-300.png                                                                                                                      
after fix
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-08-21-14/vmstat-dirty-300.png                                                                                                                      

case (3) problem, not good looking but otherwise harmless, not fixed yet
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1K-8p-2953M-2.6.37-rc3+-2010-11-30-14/vmstat-written.png
root cause is btrfs always clear page dirty in the end of prepare_pages() and
then to set it dirty again in dirty_and_release_pages(). This leads to
duplicate dirty accounting on 1KB-size writes.

case (3) problem, bdi limit exceeded on 10+ concurrent dd's, fixed by
[PATCH 37/47] btrfs: lower the dirty balacing rate limit
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-02-20/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-02-20/dirty-pages.png

case (2) problem, not root caused yet

in vanilla kernel, the dirty/writeback pages are interesting
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5-2010-12-10-14-37/vmstat-dirty.png

but performance is still excellent
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5-2010-12-10-14-37/iostat-bw.png

with IO-less balance_dirty_pages(), it's much more slow
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-03-54/iostat-bw.png

dirty pages go very low
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-03-54/vmstat-dirty.png

with only 20% disk util
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-03-54/iostat-util.png

EXT4

case (3) problem, maybe memory leak, not root caused yet
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/ext4-100dd-1M-24p-15976M-2.6.37-rc5+-2010-12-09-23-40/dirty-pages.png

case (3) problem, burst-of-redirty, known issue with data=ordered, would be non-trivial to fix
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages-3000.png
the workaround now is to mount with data=writeback
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4_wb-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-12-13-40/dirty-pages.png

EXT3

Maybe not a big problem, but I noticed the dd task may get stuck for up to
500ms, perhaps in write_begin/end(). It shows up as negative pause time in
the below graph, accompanied with sudden drop of dirty pages.
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext3-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-01-07/dirty-pages-200.png
the writeback pages also drop from time to time
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext3-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-01-07/vmstat-dirty-300.png
and the average request size may drop from ~1M to ~500K at times
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext3-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-01-07/iostat-misc.png

NFS

There are some hard problems
- large fluctuations of everything
- writeback/unstable pages squeezing dirty pages
- sometimes it may stall the dirtiers for 1-2 seconds because no COMMITs return
  during the time, hard to fix in the client side

before the patches
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5-2010-12-11-10-31/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5-2010-12-10-12-40/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-4K-8p-2953M-2.6.37-rc3+-2010-11-29-10/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-4K-8p-2953M-2.6.37-rc3+-2010-11-29-10/dirty-bandwidth.png

after patches
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/dirty-bandwidth-3000.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/vmstat-dirty.png

burst of commit submits/returns
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/nfs-commit-1000.png
after fix
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/nfs-commit-300.png

The 1-second stall happens at around 317s and 321s. Fortunately it only
happens for 10+ concurrent dd's, which is not typical NFS client workloads.
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/nfs-commit-300.png


XFS

performs mostly ideal, except for some trivial imperfections: somewhere
the lines are not straight.

dirty/writeback pages
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/xfs-1000dd-1M-24p-15976M-2.6.37-rc5-2010-12-10-18-18/vmstat-dirty.png

avg queue size and wait time
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/xfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-02-53/iostat-misc.png

bandwidth
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/xfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-02-53/dirty-bandwidth.png


Changes from v3 <http://lkml.org/lkml/2010/12/13/69>

- fold patches and reorganize patchset; each patch passed compile test
- remove patch "writeback: make reasonable gap between the dirty/background thresholds"

Changes from v2 <http://lkml.org/lkml/2010/11/16/728>

- lock protected bdi bandwidth estimation
- user space think time compensation
- raise max pause time to 200ms for lower CPU overheads on concurrent dirtiers
- control system enhancements to handle large pause time and huge number of tasks
- concurrent dd test suite and a lot of tests
- adaptive scale up writeback chunk size
- make it right for small memory systems
- various bug fixes
- new trace points

Changes from initial RFC <http://thread.gmane.org/gmane.linux.kernel.mm/52966>

- adaptive rate limiting, to reduce overheads when under throttle threshold
- prevent overrunning dirty limit on lots of concurrent dirtiers
- add Documentation/filesystems/writeback-throttling-design.txt
- lower max pause time from 200ms to 100ms; min pause time from 10ms to 1jiffy
- don't drop the laptop mode code
- update and comment the trace event
- benchmarks on concurrent dd and fs_mark covering both large and tiny files
- bdi->write_bandwidth updates should be rate limited on concurrent dirtiers,
  otherwise it will drift fast and fluctuate
- don't call balance_dirty_pages_ratelimit() when writing to already dirtied
  pages, otherwise the task will be throttled too much

[PATCH 01/35] writeback: enabling gate limit for light dirtied bdi
[PATCH 02/35] writeback: safety margin for bdi stat error
[PATCH 03/35] writeback: prevent duplicate balance_dirty_pages_ratelimited() calls
[PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time

[PATCH 05/35] writeback: IO-less balance_dirty_pages()
[PATCH 06/35] writeback: consolidate variable names in balance_dirty_pages()
[PATCH 07/35] writeback: per-task rate limit on balance_dirty_pages()
[PATCH 08/35] writeback: user space think time compensation
[PATCH 09/35] writeback: account per-bdi accumulated written pages
[PATCH 10/35] writeback: bdi write bandwidth estimation
[PATCH 11/35] writeback: show bdi write bandwidth in debugfs
[PATCH 12/35] writeback: scale down max throttle bandwidth on concurrent dirtiers
[PATCH 13/35] writeback: bdi base throttle bandwidth
[PATCH 14/35] writeback: smoothed bdi dirty pages
[PATCH 15/35] writeback: adapt max balance pause time to memory size
[PATCH 16/35] writeback: increase min pause time on concurrent dirtiers
[PATCH 17/35] writeback: quit throttling when bdi dirty pages dropped low
[PATCH 18/35] writeback: start background writeback earlier

[PATCH 19/35] writeback: make nr_to_write a per-file limit
[PATCH 20/35] writeback: scale IO chunk size up to device bandwidth

[PATCH 21/35] writeback: trace balance_dirty_pages()
[PATCH 22/35] writeback: trace global dirty page states
[PATCH 23/35] writeback: trace writeback_single_inode()

[PATCH 24/35] btrfs: dont call balance_dirty_pages_ratelimited() on already dirty pages
[PATCH 25/35] btrfs: lower the dirty balacing rate limit
[PATCH 26/35] btrfs: wait on too many nr_async_bios

[PATCH 27/35] nfs: livelock prevention is now done in VFS
[PATCH 28/35] nfs: writeback pages wait queue
[PATCH 29/35] nfs: in-commit pages accounting and wait queue
[PATCH 30/35] nfs: heuristics to avoid commit
[PATCH 31/35] nfs: dont change wbc->nr_to_write in write_inode()
[PATCH 32/35] nfs: limit the range of commits
[PATCH 33/35] nfs: adapt congestion threshold to dirty threshold
[PATCH 34/35] nfs: trace nfs_commit_unstable_pages()
[PATCH 35/35] nfs: trace nfs_commit_release()

 Documentation/filesystems/writeback-throttling-design.txt |  210 ++++
 fs/btrfs/disk-io.c                                        |    7 
 fs/btrfs/file.c                                           |   16 
 fs/btrfs/ioctl.c                                          |    6 
 fs/btrfs/relocation.c                                     |    6 
 fs/fs-writeback.c                                         |   85 +
 fs/nfs/client.c                                           |    3 
 fs/nfs/file.c                                             |    9 
 fs/nfs/write.c                                            |  241 +++-
 include/linux/backing-dev.h                               |    9 
 include/linux/nfs_fs.h                                    |    1 
 include/linux/nfs_fs_sb.h                                 |    3 
 include/linux/sched.h                                     |    8 
 include/linux/writeback.h                                 |   26 
 include/trace/events/nfs.h                                |   89 +
 include/trace/events/writeback.h                          |  195 +++
 mm/backing-dev.c                                          |   32 
 mm/filemap.c                                              |    5 
 mm/memory_hotplug.c                                       |    3 
 mm/page-writeback.c                                       |  518 +++++++---
 20 files changed, 1212 insertions(+), 260 deletions(-)

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 00/35] IO-less dirty throttling v4
@ 2010-12-13 14:46 ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	Wu Fengguang, linux-mm, linux-fsdevel, LKML


Andrew,

I'm glad to release this extensively tested v4 IO-less dirty throttling
patchset. It's based on 2.6.37-rc5 and Jan's sync livelock patches.

Given its trickiness and possibility of side effects, independent tests
are highly welcome. Here is the git tree for easy access

git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v4

Andrew, I followed your suggestion to add some trace points, and goes further
to write scripts to do automated tests and to visualize the collected trace,
iostat and vmstat data. The help is tremendous. The tests and data analyzes
pave way to many fixes and algorithm improvements.

It still took long time. The most challenging tasks are the fluctuations on
100+ dd and on NFS, and various imperfections in the control system and in
many filesystems. I'd say I won't be able to go this far without the help of
the pretty graphs. And I believe they'll continue to make future maintenance
easy. To identify problems reported by the end users, just ask for the traces,
I'll then turn them into graphs and quickly get an overview of the problem.

The most up-to-date graphs and the corresponding scripts are uploaded to

	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests

Here you may find and compare test results for this patchset (2.6.37-rc5+) and
for vanilla kernel (2.6.37-rc5). Filesystem developers may be interested
to take a look at the dynamics.

The control algorithms are generally doing good in the recent graphs.
There are regular fluctuations of the dirty pages number, however they
are mostly originated from underneath: the low level is reporting IO
completion in units of 1MB, 32MB or even more, leading to sudden drops
of the dirty pages.

The tests cover the common scenarios

- ext2, ext3, ext4, xfs, btrfs, nfs
- 256M, 512M, 3G, 16G memory sizes
- single disk and 12-disk array
- 1, 2, 10, 100, 1000 concurrent dd's

They disclose lots of imperfections and bugs of
1) this patchset
2) file system not working well with the new paradigm 
3) file system problems also exist in vanilla kernel

I managed to fix case (1) and most of (2) and report (3).
Below are some interesting graphs illustrating the problems.

BTRFS

case (3) problem, nr_dirty going all the way down to 0, fixed by
[PATCH 38/47] btrfs: wait on too many nr_async_bios
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1K-8p-2953M-2.6.37-rc3+-2010-11-30-17/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5-2010-12-10-21-23/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-08-21-30/vmstat-dirty-300.png                                                                                                                      
after fix
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-08-21-14/vmstat-dirty-300.png                                                                                                                      

case (3) problem, not good looking but otherwise harmless, not fixed yet
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1K-8p-2953M-2.6.37-rc3+-2010-11-30-14/vmstat-written.png
root cause is btrfs always clear page dirty in the end of prepare_pages() and
then to set it dirty again in dirty_and_release_pages(). This leads to
duplicate dirty accounting on 1KB-size writes.

case (3) problem, bdi limit exceeded on 10+ concurrent dd's, fixed by
[PATCH 37/47] btrfs: lower the dirty balacing rate limit
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-02-20/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-02-20/dirty-pages.png

case (2) problem, not root caused yet

in vanilla kernel, the dirty/writeback pages are interesting
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5-2010-12-10-14-37/vmstat-dirty.png

but performance is still excellent
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5-2010-12-10-14-37/iostat-bw.png

with IO-less balance_dirty_pages(), it's much more slow
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-03-54/iostat-bw.png

dirty pages go very low
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-03-54/vmstat-dirty.png

with only 20% disk util
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-03-54/iostat-util.png

EXT4

case (3) problem, maybe memory leak, not root caused yet
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/ext4-100dd-1M-24p-15976M-2.6.37-rc5+-2010-12-09-23-40/dirty-pages.png

case (3) problem, burst-of-redirty, known issue with data=ordered, would be non-trivial to fix
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages-3000.png
the workaround now is to mount with data=writeback
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4_wb-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-12-13-40/dirty-pages.png

EXT3

Maybe not a big problem, but I noticed the dd task may get stuck for up to
500ms, perhaps in write_begin/end(). It shows up as negative pause time in
the below graph, accompanied with sudden drop of dirty pages.
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext3-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-01-07/dirty-pages-200.png
the writeback pages also drop from time to time
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext3-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-01-07/vmstat-dirty-300.png
and the average request size may drop from ~1M to ~500K at times
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext3-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-01-07/iostat-misc.png

NFS

There are some hard problems
- large fluctuations of everything
- writeback/unstable pages squeezing dirty pages
- sometimes it may stall the dirtiers for 1-2 seconds because no COMMITs return
  during the time, hard to fix in the client side

before the patches
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5-2010-12-11-10-31/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5-2010-12-10-12-40/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-4K-8p-2953M-2.6.37-rc3+-2010-11-29-10/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-4K-8p-2953M-2.6.37-rc3+-2010-11-29-10/dirty-bandwidth.png

after patches
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/dirty-bandwidth-3000.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/vmstat-dirty.png

burst of commit submits/returns
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/nfs-commit-1000.png
after fix
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/nfs-commit-300.png

The 1-second stall happens at around 317s and 321s. Fortunately it only
happens for 10+ concurrent dd's, which is not typical NFS client workloads.
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/nfs-commit-300.png


XFS

performs mostly ideal, except for some trivial imperfections: somewhere
the lines are not straight.

dirty/writeback pages
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/xfs-1000dd-1M-24p-15976M-2.6.37-rc5-2010-12-10-18-18/vmstat-dirty.png

avg queue size and wait time
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/xfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-02-53/iostat-misc.png

bandwidth
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/xfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-02-53/dirty-bandwidth.png


Changes from v3 <http://lkml.org/lkml/2010/12/13/69>

- fold patches and reorganize patchset; each patch passed compile test
- remove patch "writeback: make reasonable gap between the dirty/background thresholds"

Changes from v2 <http://lkml.org/lkml/2010/11/16/728>

- lock protected bdi bandwidth estimation
- user space think time compensation
- raise max pause time to 200ms for lower CPU overheads on concurrent dirtiers
- control system enhancements to handle large pause time and huge number of tasks
- concurrent dd test suite and a lot of tests
- adaptive scale up writeback chunk size
- make it right for small memory systems
- various bug fixes
- new trace points

Changes from initial RFC <http://thread.gmane.org/gmane.linux.kernel.mm/52966>

- adaptive rate limiting, to reduce overheads when under throttle threshold
- prevent overrunning dirty limit on lots of concurrent dirtiers
- add Documentation/filesystems/writeback-throttling-design.txt
- lower max pause time from 200ms to 100ms; min pause time from 10ms to 1jiffy
- don't drop the laptop mode code
- update and comment the trace event
- benchmarks on concurrent dd and fs_mark covering both large and tiny files
- bdi->write_bandwidth updates should be rate limited on concurrent dirtiers,
  otherwise it will drift fast and fluctuate
- don't call balance_dirty_pages_ratelimit() when writing to already dirtied
  pages, otherwise the task will be throttled too much

[PATCH 01/35] writeback: enabling gate limit for light dirtied bdi
[PATCH 02/35] writeback: safety margin for bdi stat error
[PATCH 03/35] writeback: prevent duplicate balance_dirty_pages_ratelimited() calls
[PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time

[PATCH 05/35] writeback: IO-less balance_dirty_pages()
[PATCH 06/35] writeback: consolidate variable names in balance_dirty_pages()
[PATCH 07/35] writeback: per-task rate limit on balance_dirty_pages()
[PATCH 08/35] writeback: user space think time compensation
[PATCH 09/35] writeback: account per-bdi accumulated written pages
[PATCH 10/35] writeback: bdi write bandwidth estimation
[PATCH 11/35] writeback: show bdi write bandwidth in debugfs
[PATCH 12/35] writeback: scale down max throttle bandwidth on concurrent dirtiers
[PATCH 13/35] writeback: bdi base throttle bandwidth
[PATCH 14/35] writeback: smoothed bdi dirty pages
[PATCH 15/35] writeback: adapt max balance pause time to memory size
[PATCH 16/35] writeback: increase min pause time on concurrent dirtiers
[PATCH 17/35] writeback: quit throttling when bdi dirty pages dropped low
[PATCH 18/35] writeback: start background writeback earlier

[PATCH 19/35] writeback: make nr_to_write a per-file limit
[PATCH 20/35] writeback: scale IO chunk size up to device bandwidth

[PATCH 21/35] writeback: trace balance_dirty_pages()
[PATCH 22/35] writeback: trace global dirty page states
[PATCH 23/35] writeback: trace writeback_single_inode()

[PATCH 24/35] btrfs: dont call balance_dirty_pages_ratelimited() on already dirty pages
[PATCH 25/35] btrfs: lower the dirty balacing rate limit
[PATCH 26/35] btrfs: wait on too many nr_async_bios

[PATCH 27/35] nfs: livelock prevention is now done in VFS
[PATCH 28/35] nfs: writeback pages wait queue
[PATCH 29/35] nfs: in-commit pages accounting and wait queue
[PATCH 30/35] nfs: heuristics to avoid commit
[PATCH 31/35] nfs: dont change wbc->nr_to_write in write_inode()
[PATCH 32/35] nfs: limit the range of commits
[PATCH 33/35] nfs: adapt congestion threshold to dirty threshold
[PATCH 34/35] nfs: trace nfs_commit_unstable_pages()
[PATCH 35/35] nfs: trace nfs_commit_release()

 Documentation/filesystems/writeback-throttling-design.txt |  210 ++++
 fs/btrfs/disk-io.c                                        |    7 
 fs/btrfs/file.c                                           |   16 
 fs/btrfs/ioctl.c                                          |    6 
 fs/btrfs/relocation.c                                     |    6 
 fs/fs-writeback.c                                         |   85 +
 fs/nfs/client.c                                           |    3 
 fs/nfs/file.c                                             |    9 
 fs/nfs/write.c                                            |  241 +++-
 include/linux/backing-dev.h                               |    9 
 include/linux/nfs_fs.h                                    |    1 
 include/linux/nfs_fs_sb.h                                 |    3 
 include/linux/sched.h                                     |    8 
 include/linux/writeback.h                                 |   26 
 include/trace/events/nfs.h                                |   89 +
 include/trace/events/writeback.h                          |  195 +++
 mm/backing-dev.c                                          |   32 
 mm/filemap.c                                              |    5 
 mm/memory_hotplug.c                                       |    3 
 mm/page-writeback.c                                       |  518 +++++++---
 20 files changed, 1212 insertions(+), 260 deletions(-)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 00/35] IO-less dirty throttling v4
@ 2010-12-13 14:46 ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	Wu Fengguang, linux-mm, linux-fsdevel, LKML


Andrew,

I'm glad to release this extensively tested v4 IO-less dirty throttling
patchset. It's based on 2.6.37-rc5 and Jan's sync livelock patches.

Given its trickiness and possibility of side effects, independent tests
are highly welcome. Here is the git tree for easy access

git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v4

Andrew, I followed your suggestion to add some trace points, and goes further
to write scripts to do automated tests and to visualize the collected trace,
iostat and vmstat data. The help is tremendous. The tests and data analyzes
pave way to many fixes and algorithm improvements.

It still took long time. The most challenging tasks are the fluctuations on
100+ dd and on NFS, and various imperfections in the control system and in
many filesystems. I'd say I won't be able to go this far without the help of
the pretty graphs. And I believe they'll continue to make future maintenance
easy. To identify problems reported by the end users, just ask for the traces,
I'll then turn them into graphs and quickly get an overview of the problem.

The most up-to-date graphs and the corresponding scripts are uploaded to

	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests

Here you may find and compare test results for this patchset (2.6.37-rc5+) and
for vanilla kernel (2.6.37-rc5). Filesystem developers may be interested
to take a look at the dynamics.

The control algorithms are generally doing good in the recent graphs.
There are regular fluctuations of the dirty pages number, however they
are mostly originated from underneath: the low level is reporting IO
completion in units of 1MB, 32MB or even more, leading to sudden drops
of the dirty pages.

The tests cover the common scenarios

- ext2, ext3, ext4, xfs, btrfs, nfs
- 256M, 512M, 3G, 16G memory sizes
- single disk and 12-disk array
- 1, 2, 10, 100, 1000 concurrent dd's

They disclose lots of imperfections and bugs of
1) this patchset
2) file system not working well with the new paradigm 
3) file system problems also exist in vanilla kernel

I managed to fix case (1) and most of (2) and report (3).
Below are some interesting graphs illustrating the problems.

BTRFS

case (3) problem, nr_dirty going all the way down to 0, fixed by
[PATCH 38/47] btrfs: wait on too many nr_async_bios
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1K-8p-2953M-2.6.37-rc3+-2010-11-30-17/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5-2010-12-10-21-23/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-08-21-30/vmstat-dirty-300.png                                                                                                                      
after fix
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-08-21-14/vmstat-dirty-300.png                                                                                                                      

case (3) problem, not good looking but otherwise harmless, not fixed yet
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1K-8p-2953M-2.6.37-rc3+-2010-11-30-14/vmstat-written.png
root cause is btrfs always clear page dirty in the end of prepare_pages() and
then to set it dirty again in dirty_and_release_pages(). This leads to
duplicate dirty accounting on 1KB-size writes.

case (3) problem, bdi limit exceeded on 10+ concurrent dd's, fixed by
[PATCH 37/47] btrfs: lower the dirty balacing rate limit
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-02-20/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-02-20/dirty-pages.png

case (2) problem, not root caused yet

in vanilla kernel, the dirty/writeback pages are interesting
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5-2010-12-10-14-37/vmstat-dirty.png

but performance is still excellent
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5-2010-12-10-14-37/iostat-bw.png

with IO-less balance_dirty_pages(), it's much more slow
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-03-54/iostat-bw.png

dirty pages go very low
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-03-54/vmstat-dirty.png

with only 20% disk util
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-03-54/iostat-util.png

EXT4

case (3) problem, maybe memory leak, not root caused yet
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/ext4-100dd-1M-24p-15976M-2.6.37-rc5+-2010-12-09-23-40/dirty-pages.png

case (3) problem, burst-of-redirty, known issue with data=ordered, would be non-trivial to fix
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages-3000.png
the workaround now is to mount with data=writeback
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4_wb-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-12-13-40/dirty-pages.png

EXT3

Maybe not a big problem, but I noticed the dd task may get stuck for up to
500ms, perhaps in write_begin/end(). It shows up as negative pause time in
the below graph, accompanied with sudden drop of dirty pages.
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext3-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-01-07/dirty-pages-200.png
the writeback pages also drop from time to time
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext3-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-01-07/vmstat-dirty-300.png
and the average request size may drop from ~1M to ~500K at times
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext3-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-01-07/iostat-misc.png

NFS

There are some hard problems
- large fluctuations of everything
- writeback/unstable pages squeezing dirty pages
- sometimes it may stall the dirtiers for 1-2 seconds because no COMMITs return
  during the time, hard to fix in the client side

before the patches
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5-2010-12-11-10-31/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5-2010-12-10-12-40/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-4K-8p-2953M-2.6.37-rc3+-2010-11-29-10/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-4K-8p-2953M-2.6.37-rc3+-2010-11-29-10/dirty-bandwidth.png

after patches
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/dirty-bandwidth-3000.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/vmstat-dirty.png

burst of commit submits/returns
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/nfs-commit-1000.png
after fix
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/nfs-commit-300.png

The 1-second stall happens at around 317s and 321s. Fortunately it only
happens for 10+ concurrent dd's, which is not typical NFS client workloads.
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/nfs-commit-300.png


XFS

performs mostly ideal, except for some trivial imperfections: somewhere
the lines are not straight.

dirty/writeback pages
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/xfs-1000dd-1M-24p-15976M-2.6.37-rc5-2010-12-10-18-18/vmstat-dirty.png

avg queue size and wait time
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/xfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-02-53/iostat-misc.png

bandwidth
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/xfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-02-53/dirty-bandwidth.png


Changes from v3 <http://lkml.org/lkml/2010/12/13/69>

- fold patches and reorganize patchset; each patch passed compile test
- remove patch "writeback: make reasonable gap between the dirty/background thresholds"

Changes from v2 <http://lkml.org/lkml/2010/11/16/728>

- lock protected bdi bandwidth estimation
- user space think time compensation
- raise max pause time to 200ms for lower CPU overheads on concurrent dirtiers
- control system enhancements to handle large pause time and huge number of tasks
- concurrent dd test suite and a lot of tests
- adaptive scale up writeback chunk size
- make it right for small memory systems
- various bug fixes
- new trace points

Changes from initial RFC <http://thread.gmane.org/gmane.linux.kernel.mm/52966>

- adaptive rate limiting, to reduce overheads when under throttle threshold
- prevent overrunning dirty limit on lots of concurrent dirtiers
- add Documentation/filesystems/writeback-throttling-design.txt
- lower max pause time from 200ms to 100ms; min pause time from 10ms to 1jiffy
- don't drop the laptop mode code
- update and comment the trace event
- benchmarks on concurrent dd and fs_mark covering both large and tiny files
- bdi->write_bandwidth updates should be rate limited on concurrent dirtiers,
  otherwise it will drift fast and fluctuate
- don't call balance_dirty_pages_ratelimit() when writing to already dirtied
  pages, otherwise the task will be throttled too much

[PATCH 01/35] writeback: enabling gate limit for light dirtied bdi
[PATCH 02/35] writeback: safety margin for bdi stat error
[PATCH 03/35] writeback: prevent duplicate balance_dirty_pages_ratelimited() calls
[PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time

[PATCH 05/35] writeback: IO-less balance_dirty_pages()
[PATCH 06/35] writeback: consolidate variable names in balance_dirty_pages()
[PATCH 07/35] writeback: per-task rate limit on balance_dirty_pages()
[PATCH 08/35] writeback: user space think time compensation
[PATCH 09/35] writeback: account per-bdi accumulated written pages
[PATCH 10/35] writeback: bdi write bandwidth estimation
[PATCH 11/35] writeback: show bdi write bandwidth in debugfs
[PATCH 12/35] writeback: scale down max throttle bandwidth on concurrent dirtiers
[PATCH 13/35] writeback: bdi base throttle bandwidth
[PATCH 14/35] writeback: smoothed bdi dirty pages
[PATCH 15/35] writeback: adapt max balance pause time to memory size
[PATCH 16/35] writeback: increase min pause time on concurrent dirtiers
[PATCH 17/35] writeback: quit throttling when bdi dirty pages dropped low
[PATCH 18/35] writeback: start background writeback earlier

[PATCH 19/35] writeback: make nr_to_write a per-file limit
[PATCH 20/35] writeback: scale IO chunk size up to device bandwidth

[PATCH 21/35] writeback: trace balance_dirty_pages()
[PATCH 22/35] writeback: trace global dirty page states
[PATCH 23/35] writeback: trace writeback_single_inode()

[PATCH 24/35] btrfs: dont call balance_dirty_pages_ratelimited() on already dirty pages
[PATCH 25/35] btrfs: lower the dirty balacing rate limit
[PATCH 26/35] btrfs: wait on too many nr_async_bios

[PATCH 27/35] nfs: livelock prevention is now done in VFS
[PATCH 28/35] nfs: writeback pages wait queue
[PATCH 29/35] nfs: in-commit pages accounting and wait queue
[PATCH 30/35] nfs: heuristics to avoid commit
[PATCH 31/35] nfs: dont change wbc->nr_to_write in write_inode()
[PATCH 32/35] nfs: limit the range of commits
[PATCH 33/35] nfs: adapt congestion threshold to dirty threshold
[PATCH 34/35] nfs: trace nfs_commit_unstable_pages()
[PATCH 35/35] nfs: trace nfs_commit_release()

 Documentation/filesystems/writeback-throttling-design.txt |  210 ++++
 fs/btrfs/disk-io.c                                        |    7 
 fs/btrfs/file.c                                           |   16 
 fs/btrfs/ioctl.c                                          |    6 
 fs/btrfs/relocation.c                                     |    6 
 fs/fs-writeback.c                                         |   85 +
 fs/nfs/client.c                                           |    3 
 fs/nfs/file.c                                             |    9 
 fs/nfs/write.c                                            |  241 +++-
 include/linux/backing-dev.h                               |    9 
 include/linux/nfs_fs.h                                    |    1 
 include/linux/nfs_fs_sb.h                                 |    3 
 include/linux/sched.h                                     |    8 
 include/linux/writeback.h                                 |   26 
 include/trace/events/nfs.h                                |   89 +
 include/trace/events/writeback.h                          |  195 +++
 mm/backing-dev.c                                          |   32 
 mm/filemap.c                                              |    5 
 mm/memory_hotplug.c                                       |    3 
 mm/page-writeback.c                                       |  518 +++++++---
 20 files changed, 1212 insertions(+), 260 deletions(-)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 01/35] writeback: enabling gate limit for light dirtied bdi
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:46   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Rik van Riel, Peter Zijlstra, Wu Fengguang,
	Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-min-bdi-dirty-limit.patch --]
[-- Type: text/plain, Size: 4291 bytes --]

I noticed that my NFSROOT test system goes slow responding when there
is heavy dd to a local disk. Traces show that the NFSROOT's bdi limit
is near 0 and many tasks in the system are repeatedly stuck in
balance_dirty_pages().

There are two generic problems:

- light dirtiers at one device (more often than not the rootfs) get
  heavily impacted by heavy dirtiers on another independent device

- the light dirtied device does heavy throttling because bdi limit=0,
  and the heavy throttling may in turn withhold its bdi limit in 0 as
  it cannot dirty fast enough to grow up the bdi's proportional weight.

Fix it by introducing some "low pass" gate, which is a small (<=32MB)
value reserved by others and can be safely "stole" from the current
global dirty margin.  It does not need to be big to help the bdi gain
its initial weight.

Acked-by: Rik van Riel <riel@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |    3 ++-
 mm/backing-dev.c          |    2 +-
 mm/page-writeback.c       |   29 ++++++++++++++++++++++++++---
 3 files changed, 29 insertions(+), 5 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:45:58.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:10.000000000 +0800
@@ -443,13 +443,26 @@ void global_dirty_limits(unsigned long *
  *
  * The bdi's share of dirty limit will be adapting to its throughput and
  * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
- */
-unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
+ *
+ * There is a chicken and egg problem: when bdi A (eg. /pub) is heavy dirtied
+ * and bdi B (eg. /) is light dirtied hence has 0 dirty limit, tasks writing to
+ * B always get heavily throttled and bdi B's dirty limit might never be able
+ * to grow up from 0. So we do tricks to reserve some global margin and honour
+ * it to the bdi's that run low.
+ */
+unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
+			      unsigned long dirty,
+			      unsigned long dirty_pages)
 {
 	u64 bdi_dirty;
 	long numerator, denominator;
 
 	/*
+	 * Provide a global safety margin of ~1%, or up to 32MB for a 20GB box.
+	 */
+	dirty -= min(dirty / 128, 32768UL >> (PAGE_SHIFT-10));
+
+	/*
 	 * Calculate this BDI's share of the dirty ratio.
 	 */
 	bdi_writeout_fraction(bdi, &numerator, &denominator);
@@ -459,6 +472,15 @@ unsigned long bdi_dirty_limit(struct bac
 	do_div(bdi_dirty, denominator);
 
 	bdi_dirty += (dirty * bdi->min_ratio) / 100;
+
+	/*
+	 * If we can dirty N more pages globally, honour N/2 to the bdi that
+	 * runs low, so as to help it ramp up.
+	 */
+	if (unlikely(bdi_dirty < (dirty - dirty_pages) / 2 &&
+		     dirty > dirty_pages))
+		bdi_dirty = (dirty - dirty_pages) / 2;
+
 	if (bdi_dirty > (dirty * bdi->max_ratio) / 100)
 		bdi_dirty = dirty * bdi->max_ratio / 100;
 
@@ -508,7 +530,8 @@ static void balance_dirty_pages(struct a
 				(background_thresh + dirty_thresh) / 2)
 			break;
 
-		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh,
+					     nr_reclaimable + nr_writeback);
 		bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
--- linux-next.orig/mm/backing-dev.c	2010-12-13 21:45:58.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-12-13 21:46:10.000000000 +0800
@@ -83,7 +83,7 @@ static int bdi_debug_stats_show(struct s
 	spin_unlock(&inode_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
-	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh, dirty_thresh);
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	seq_printf(m,
--- linux-next.orig/include/linux/writeback.h	2010-12-13 21:45:58.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-12-13 21:46:10.000000000 +0800
@@ -126,7 +126,8 @@ int dirty_writeback_centisecs_handler(st
 
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
-			       unsigned long dirty);
+			       unsigned long dirty,
+			       unsigned long dirty_pages);
 
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 01/35] writeback: enabling gate limit for light dirtied bdi
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Rik van Riel, Peter Zijlstra, Wu Fengguang,
	Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-min-bdi-dirty-limit.patch --]
[-- Type: text/plain, Size: 4587 bytes --]

I noticed that my NFSROOT test system goes slow responding when there
is heavy dd to a local disk. Traces show that the NFSROOT's bdi limit
is near 0 and many tasks in the system are repeatedly stuck in
balance_dirty_pages().

There are two generic problems:

- light dirtiers at one device (more often than not the rootfs) get
  heavily impacted by heavy dirtiers on another independent device

- the light dirtied device does heavy throttling because bdi limit=0,
  and the heavy throttling may in turn withhold its bdi limit in 0 as
  it cannot dirty fast enough to grow up the bdi's proportional weight.

Fix it by introducing some "low pass" gate, which is a small (<=32MB)
value reserved by others and can be safely "stole" from the current
global dirty margin.  It does not need to be big to help the bdi gain
its initial weight.

Acked-by: Rik van Riel <riel@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |    3 ++-
 mm/backing-dev.c          |    2 +-
 mm/page-writeback.c       |   29 ++++++++++++++++++++++++++---
 3 files changed, 29 insertions(+), 5 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:45:58.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:10.000000000 +0800
@@ -443,13 +443,26 @@ void global_dirty_limits(unsigned long *
  *
  * The bdi's share of dirty limit will be adapting to its throughput and
  * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
- */
-unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
+ *
+ * There is a chicken and egg problem: when bdi A (eg. /pub) is heavy dirtied
+ * and bdi B (eg. /) is light dirtied hence has 0 dirty limit, tasks writing to
+ * B always get heavily throttled and bdi B's dirty limit might never be able
+ * to grow up from 0. So we do tricks to reserve some global margin and honour
+ * it to the bdi's that run low.
+ */
+unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
+			      unsigned long dirty,
+			      unsigned long dirty_pages)
 {
 	u64 bdi_dirty;
 	long numerator, denominator;
 
 	/*
+	 * Provide a global safety margin of ~1%, or up to 32MB for a 20GB box.
+	 */
+	dirty -= min(dirty / 128, 32768UL >> (PAGE_SHIFT-10));
+
+	/*
 	 * Calculate this BDI's share of the dirty ratio.
 	 */
 	bdi_writeout_fraction(bdi, &numerator, &denominator);
@@ -459,6 +472,15 @@ unsigned long bdi_dirty_limit(struct bac
 	do_div(bdi_dirty, denominator);
 
 	bdi_dirty += (dirty * bdi->min_ratio) / 100;
+
+	/*
+	 * If we can dirty N more pages globally, honour N/2 to the bdi that
+	 * runs low, so as to help it ramp up.
+	 */
+	if (unlikely(bdi_dirty < (dirty - dirty_pages) / 2 &&
+		     dirty > dirty_pages))
+		bdi_dirty = (dirty - dirty_pages) / 2;
+
 	if (bdi_dirty > (dirty * bdi->max_ratio) / 100)
 		bdi_dirty = dirty * bdi->max_ratio / 100;
 
@@ -508,7 +530,8 @@ static void balance_dirty_pages(struct a
 				(background_thresh + dirty_thresh) / 2)
 			break;
 
-		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh,
+					     nr_reclaimable + nr_writeback);
 		bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
--- linux-next.orig/mm/backing-dev.c	2010-12-13 21:45:58.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-12-13 21:46:10.000000000 +0800
@@ -83,7 +83,7 @@ static int bdi_debug_stats_show(struct s
 	spin_unlock(&inode_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
-	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh, dirty_thresh);
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	seq_printf(m,
--- linux-next.orig/include/linux/writeback.h	2010-12-13 21:45:58.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-12-13 21:46:10.000000000 +0800
@@ -126,7 +126,8 @@ int dirty_writeback_centisecs_handler(st
 
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
-			       unsigned long dirty);
+			       unsigned long dirty,
+			       unsigned long dirty_pages);
 
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 01/35] writeback: enabling gate limit for light dirtied bdi
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Rik van Riel, Peter Zijlstra, Wu Fengguang,
	Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-min-bdi-dirty-limit.patch --]
[-- Type: text/plain, Size: 4587 bytes --]

I noticed that my NFSROOT test system goes slow responding when there
is heavy dd to a local disk. Traces show that the NFSROOT's bdi limit
is near 0 and many tasks in the system are repeatedly stuck in
balance_dirty_pages().

There are two generic problems:

- light dirtiers at one device (more often than not the rootfs) get
  heavily impacted by heavy dirtiers on another independent device

- the light dirtied device does heavy throttling because bdi limit=0,
  and the heavy throttling may in turn withhold its bdi limit in 0 as
  it cannot dirty fast enough to grow up the bdi's proportional weight.

Fix it by introducing some "low pass" gate, which is a small (<=32MB)
value reserved by others and can be safely "stole" from the current
global dirty margin.  It does not need to be big to help the bdi gain
its initial weight.

Acked-by: Rik van Riel <riel@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |    3 ++-
 mm/backing-dev.c          |    2 +-
 mm/page-writeback.c       |   29 ++++++++++++++++++++++++++---
 3 files changed, 29 insertions(+), 5 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:45:58.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:10.000000000 +0800
@@ -443,13 +443,26 @@ void global_dirty_limits(unsigned long *
  *
  * The bdi's share of dirty limit will be adapting to its throughput and
  * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
- */
-unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
+ *
+ * There is a chicken and egg problem: when bdi A (eg. /pub) is heavy dirtied
+ * and bdi B (eg. /) is light dirtied hence has 0 dirty limit, tasks writing to
+ * B always get heavily throttled and bdi B's dirty limit might never be able
+ * to grow up from 0. So we do tricks to reserve some global margin and honour
+ * it to the bdi's that run low.
+ */
+unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
+			      unsigned long dirty,
+			      unsigned long dirty_pages)
 {
 	u64 bdi_dirty;
 	long numerator, denominator;
 
 	/*
+	 * Provide a global safety margin of ~1%, or up to 32MB for a 20GB box.
+	 */
+	dirty -= min(dirty / 128, 32768UL >> (PAGE_SHIFT-10));
+
+	/*
 	 * Calculate this BDI's share of the dirty ratio.
 	 */
 	bdi_writeout_fraction(bdi, &numerator, &denominator);
@@ -459,6 +472,15 @@ unsigned long bdi_dirty_limit(struct bac
 	do_div(bdi_dirty, denominator);
 
 	bdi_dirty += (dirty * bdi->min_ratio) / 100;
+
+	/*
+	 * If we can dirty N more pages globally, honour N/2 to the bdi that
+	 * runs low, so as to help it ramp up.
+	 */
+	if (unlikely(bdi_dirty < (dirty - dirty_pages) / 2 &&
+		     dirty > dirty_pages))
+		bdi_dirty = (dirty - dirty_pages) / 2;
+
 	if (bdi_dirty > (dirty * bdi->max_ratio) / 100)
 		bdi_dirty = dirty * bdi->max_ratio / 100;
 
@@ -508,7 +530,8 @@ static void balance_dirty_pages(struct a
 				(background_thresh + dirty_thresh) / 2)
 			break;
 
-		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh,
+					     nr_reclaimable + nr_writeback);
 		bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
--- linux-next.orig/mm/backing-dev.c	2010-12-13 21:45:58.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-12-13 21:46:10.000000000 +0800
@@ -83,7 +83,7 @@ static int bdi_debug_stats_show(struct s
 	spin_unlock(&inode_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
-	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh, dirty_thresh);
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	seq_printf(m,
--- linux-next.orig/include/linux/writeback.h	2010-12-13 21:45:58.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-12-13 21:46:10.000000000 +0800
@@ -126,7 +126,8 @@ int dirty_writeback_centisecs_handler(st
 
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
-			       unsigned long dirty);
+			       unsigned long dirty,
+			       unsigned long dirty_pages);
 
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 02/35] writeback: safety margin for bdi stat error
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:46   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bdi-error.patch --]
[-- Type: text/plain, Size: 2441 bytes --]

In a simple dd test on a 8p system with "mem=256M", I find all light
dirtier tasks on the root fs are get heavily throttled. That happens
because the global limit is exceeded. It's unbelievable at first sight,
because the test fs doing the heavy dd is under its bdi limit.  After
doing some tracing, it's discovered that

        bdi_dirty < bdi_dirty_limit() < global_dirty_limit() < nr_dirty

So the root cause is, the bdi_dirty is well under the global nr_dirty
due to accounting errors. This can be fixed by using bdi_stat_sum(),
however that's costly on large NUMA machines. So do a less costly fix
of lowering the bdi limit, so that the accounting errors won't lead to
the absurd situation "global limit exceeded but bdi limit not exceeded".

This provides guarantee when there is only 1 heavily dirtied bdi, and
works by opportunity for 2+ heavy dirtied bdi's (hopefully they won't
reach big error _and_ exceed their bdi limit at the same time).

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:10.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
@@ -434,10 +434,16 @@ void global_dirty_limits(unsigned long *
 	*pdirty = dirty;
 }
 
-/*
+/**
  * bdi_dirty_limit - @bdi's share of dirty throttling threshold
+ * @bdi: the backing_dev_info to query
+ * @dirty: global dirty limit in pages
+ * @dirty_pages: current number of dirty pages
  *
- * Allocate high/low dirty limits to fast/slow devices, in order to prevent
+ * Returns @bdi's dirty limit in pages. The term "dirty" in the context of
+ * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
+ *
+ * It allocates high/low dirty limits to fast/slow devices, in order to prevent
  * - starving fast devices
  * - piling up dirty pages (that will take long time to sync) on slow devices
  *
@@ -458,6 +464,14 @@ unsigned long bdi_dirty_limit(struct bac
 	long numerator, denominator;
 
 	/*
+	 * try to prevent "global limit exceeded but bdi limit not exceeded"
+	 */
+	if (likely(dirty > bdi_stat_error(bdi)))
+		dirty -= bdi_stat_error(bdi);
+	else
+		return 0;
+
+	/*
 	 * Provide a global safety margin of ~1%, or up to 32MB for a 20GB box.
 	 */
 	dirty -= min(dirty / 128, 32768UL >> (PAGE_SHIFT-10));



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 02/35] writeback: safety margin for bdi stat error
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bdi-error.patch --]
[-- Type: text/plain, Size: 2737 bytes --]

In a simple dd test on a 8p system with "mem=256M", I find all light
dirtier tasks on the root fs are get heavily throttled. That happens
because the global limit is exceeded. It's unbelievable at first sight,
because the test fs doing the heavy dd is under its bdi limit.  After
doing some tracing, it's discovered that

        bdi_dirty < bdi_dirty_limit() < global_dirty_limit() < nr_dirty

So the root cause is, the bdi_dirty is well under the global nr_dirty
due to accounting errors. This can be fixed by using bdi_stat_sum(),
however that's costly on large NUMA machines. So do a less costly fix
of lowering the bdi limit, so that the accounting errors won't lead to
the absurd situation "global limit exceeded but bdi limit not exceeded".

This provides guarantee when there is only 1 heavily dirtied bdi, and
works by opportunity for 2+ heavy dirtied bdi's (hopefully they won't
reach big error _and_ exceed their bdi limit at the same time).

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:10.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
@@ -434,10 +434,16 @@ void global_dirty_limits(unsigned long *
 	*pdirty = dirty;
 }
 
-/*
+/**
  * bdi_dirty_limit - @bdi's share of dirty throttling threshold
+ * @bdi: the backing_dev_info to query
+ * @dirty: global dirty limit in pages
+ * @dirty_pages: current number of dirty pages
  *
- * Allocate high/low dirty limits to fast/slow devices, in order to prevent
+ * Returns @bdi's dirty limit in pages. The term "dirty" in the context of
+ * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
+ *
+ * It allocates high/low dirty limits to fast/slow devices, in order to prevent
  * - starving fast devices
  * - piling up dirty pages (that will take long time to sync) on slow devices
  *
@@ -458,6 +464,14 @@ unsigned long bdi_dirty_limit(struct bac
 	long numerator, denominator;
 
 	/*
+	 * try to prevent "global limit exceeded but bdi limit not exceeded"
+	 */
+	if (likely(dirty > bdi_stat_error(bdi)))
+		dirty -= bdi_stat_error(bdi);
+	else
+		return 0;
+
+	/*
 	 * Provide a global safety margin of ~1%, or up to 32MB for a 20GB box.
 	 */
 	dirty -= min(dirty / 128, 32768UL >> (PAGE_SHIFT-10));


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 02/35] writeback: safety margin for bdi stat error
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bdi-error.patch --]
[-- Type: text/plain, Size: 2737 bytes --]

In a simple dd test on a 8p system with "mem=256M", I find all light
dirtier tasks on the root fs are get heavily throttled. That happens
because the global limit is exceeded. It's unbelievable at first sight,
because the test fs doing the heavy dd is under its bdi limit.  After
doing some tracing, it's discovered that

        bdi_dirty < bdi_dirty_limit() < global_dirty_limit() < nr_dirty

So the root cause is, the bdi_dirty is well under the global nr_dirty
due to accounting errors. This can be fixed by using bdi_stat_sum(),
however that's costly on large NUMA machines. So do a less costly fix
of lowering the bdi limit, so that the accounting errors won't lead to
the absurd situation "global limit exceeded but bdi limit not exceeded".

This provides guarantee when there is only 1 heavily dirtied bdi, and
works by opportunity for 2+ heavy dirtied bdi's (hopefully they won't
reach big error _and_ exceed their bdi limit at the same time).

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:10.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
@@ -434,10 +434,16 @@ void global_dirty_limits(unsigned long *
 	*pdirty = dirty;
 }
 
-/*
+/**
  * bdi_dirty_limit - @bdi's share of dirty throttling threshold
+ * @bdi: the backing_dev_info to query
+ * @dirty: global dirty limit in pages
+ * @dirty_pages: current number of dirty pages
  *
- * Allocate high/low dirty limits to fast/slow devices, in order to prevent
+ * Returns @bdi's dirty limit in pages. The term "dirty" in the context of
+ * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
+ *
+ * It allocates high/low dirty limits to fast/slow devices, in order to prevent
  * - starving fast devices
  * - piling up dirty pages (that will take long time to sync) on slow devices
  *
@@ -458,6 +464,14 @@ unsigned long bdi_dirty_limit(struct bac
 	long numerator, denominator;
 
 	/*
+	 * try to prevent "global limit exceeded but bdi limit not exceeded"
+	 */
+	if (likely(dirty > bdi_stat_error(bdi)))
+		dirty -= bdi_stat_error(bdi);
+	else
+		return 0;
+
+	/*
 	 * Provide a global safety margin of ~1%, or up to 32MB for a 20GB box.
 	 */
 	dirty -= min(dirty / 128, 32768UL >> (PAGE_SHIFT-10));


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 03/35] writeback: prevent duplicate balance_dirty_pages_ratelimited() calls
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:46   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-fix-duplicate-bdp-calls.patch --]
[-- Type: text/plain, Size: 1125 bytes --]

When dd in 512bytes, balance_dirty_pages_ratelimited() used to be called
8 times for the same page, even if the page is only dirtied once. Fix it
with a (slightly racy) PageDirty() test.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/filemap.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/filemap.c	2010-12-13 21:45:57.000000000 +0800
+++ linux-next/mm/filemap.c	2010-12-13 21:46:11.000000000 +0800
@@ -2244,6 +2244,7 @@ static ssize_t generic_perform_write(str
 	long status = 0;
 	ssize_t written = 0;
 	unsigned int flags = 0;
+	unsigned int dirty;
 
 	/*
 	 * Copies from kernel address space cannot fail (NFSD is a big user).
@@ -2292,6 +2293,7 @@ again:
 		pagefault_enable();
 		flush_dcache_page(page);
 
+		dirty = PageDirty(page);
 		mark_page_accessed(page);
 		status = a_ops->write_end(file, mapping, pos, bytes, copied,
 						page, fsdata);
@@ -2318,7 +2320,8 @@ again:
 		pos += copied;
 		written += copied;
 
-		balance_dirty_pages_ratelimited(mapping);
+		if (!dirty)
+			balance_dirty_pages_ratelimited(mapping);
 
 	} while (iov_iter_count(i));
 



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 03/35] writeback: prevent duplicate balance_dirty_pages_ratelimited() calls
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-fix-duplicate-bdp-calls.patch --]
[-- Type: text/plain, Size: 1421 bytes --]

When dd in 512bytes, balance_dirty_pages_ratelimited() used to be called
8 times for the same page, even if the page is only dirtied once. Fix it
with a (slightly racy) PageDirty() test.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/filemap.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/filemap.c	2010-12-13 21:45:57.000000000 +0800
+++ linux-next/mm/filemap.c	2010-12-13 21:46:11.000000000 +0800
@@ -2244,6 +2244,7 @@ static ssize_t generic_perform_write(str
 	long status = 0;
 	ssize_t written = 0;
 	unsigned int flags = 0;
+	unsigned int dirty;
 
 	/*
 	 * Copies from kernel address space cannot fail (NFSD is a big user).
@@ -2292,6 +2293,7 @@ again:
 		pagefault_enable();
 		flush_dcache_page(page);
 
+		dirty = PageDirty(page);
 		mark_page_accessed(page);
 		status = a_ops->write_end(file, mapping, pos, bytes, copied,
 						page, fsdata);
@@ -2318,7 +2320,8 @@ again:
 		pos += copied;
 		written += copied;
 
-		balance_dirty_pages_ratelimited(mapping);
+		if (!dirty)
+			balance_dirty_pages_ratelimited(mapping);
 
 	} while (iov_iter_count(i));
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 03/35] writeback: prevent duplicate balance_dirty_pages_ratelimited() calls
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-fix-duplicate-bdp-calls.patch --]
[-- Type: text/plain, Size: 1421 bytes --]

When dd in 512bytes, balance_dirty_pages_ratelimited() used to be called
8 times for the same page, even if the page is only dirtied once. Fix it
with a (slightly racy) PageDirty() test.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/filemap.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/filemap.c	2010-12-13 21:45:57.000000000 +0800
+++ linux-next/mm/filemap.c	2010-12-13 21:46:11.000000000 +0800
@@ -2244,6 +2244,7 @@ static ssize_t generic_perform_write(str
 	long status = 0;
 	ssize_t written = 0;
 	unsigned int flags = 0;
+	unsigned int dirty;
 
 	/*
 	 * Copies from kernel address space cannot fail (NFSD is a big user).
@@ -2292,6 +2293,7 @@ again:
 		pagefault_enable();
 		flush_dcache_page(page);
 
+		dirty = PageDirty(page);
 		mark_page_accessed(page);
 		status = a_ops->write_end(file, mapping, pos, bytes, copied,
 						page, fsdata);
@@ -2318,7 +2320,8 @@ again:
 		pos += copied;
 		written += copied;
 
-		balance_dirty_pages_ratelimited(mapping);
+		if (!dirty)
+			balance_dirty_pages_ratelimited(mapping);
 
 	} while (iov_iter_count(i));
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:46   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Peter Zijlstra, Richard Kennedy, Wu Fengguang,
	Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Greg Thelen, Minchan Kim, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-speedup-per-bdi-threshold-ramp-up.patch --]
[-- Type: text/plain, Size: 1521 bytes --]

Reduce the dampening for the control system, yielding faster
convergence.

Currently it converges at a snail's pace for slow devices (in order of
minutes).  For really fast storage, the convergence speed should be fine.

It makes sense to make it reasonably fast for typical desktops.

After patch, it converges in ~10 seconds for 60MB/s writes and 4GB mem.
So expect ~1s for a fast 600MB/s storage under 4GB mem, or ~4s under
16GB mem, which seems reasonable.

$ while true; do grep BdiDirtyThresh /debug/bdi/8:0/stats; sleep 1; done
BdiDirtyThresh:            0 kB
BdiDirtyThresh:       118748 kB
BdiDirtyThresh:       214280 kB
BdiDirtyThresh:       303868 kB
BdiDirtyThresh:       376528 kB
BdiDirtyThresh:       411180 kB
BdiDirtyThresh:       448636 kB
BdiDirtyThresh:       472260 kB
BdiDirtyThresh:       490924 kB
BdiDirtyThresh:       499596 kB
BdiDirtyThresh:       507068 kB
...
DirtyThresh:          530392 kB

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
@@ -145,7 +145,7 @@ static int calc_period_shift(void)
 	else
 		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
 				100;
-	return 2 + ilog2(dirty_total - 1);
+	return ilog2(dirty_total - 1) - 1;
 }
 
 /*



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Peter Zijlstra, Richard Kennedy, Wu Fengguang,
	Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Greg Thelen, Minchan Kim, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-speedup-per-bdi-threshold-ramp-up.patch --]
[-- Type: text/plain, Size: 1817 bytes --]

Reduce the dampening for the control system, yielding faster
convergence.

Currently it converges at a snail's pace for slow devices (in order of
minutes).  For really fast storage, the convergence speed should be fine.

It makes sense to make it reasonably fast for typical desktops.

After patch, it converges in ~10 seconds for 60MB/s writes and 4GB mem.
So expect ~1s for a fast 600MB/s storage under 4GB mem, or ~4s under
16GB mem, which seems reasonable.

$ while true; do grep BdiDirtyThresh /debug/bdi/8:0/stats; sleep 1; done
BdiDirtyThresh:            0 kB
BdiDirtyThresh:       118748 kB
BdiDirtyThresh:       214280 kB
BdiDirtyThresh:       303868 kB
BdiDirtyThresh:       376528 kB
BdiDirtyThresh:       411180 kB
BdiDirtyThresh:       448636 kB
BdiDirtyThresh:       472260 kB
BdiDirtyThresh:       490924 kB
BdiDirtyThresh:       499596 kB
BdiDirtyThresh:       507068 kB
...
DirtyThresh:          530392 kB

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
@@ -145,7 +145,7 @@ static int calc_period_shift(void)
 	else
 		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
 				100;
-	return 2 + ilog2(dirty_total - 1);
+	return ilog2(dirty_total - 1) - 1;
 }
 
 /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Peter Zijlstra, Richard Kennedy, Wu Fengguang,
	Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Greg Thelen, Minchan Kim, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-speedup-per-bdi-threshold-ramp-up.patch --]
[-- Type: text/plain, Size: 1817 bytes --]

Reduce the dampening for the control system, yielding faster
convergence.

Currently it converges at a snail's pace for slow devices (in order of
minutes).  For really fast storage, the convergence speed should be fine.

It makes sense to make it reasonably fast for typical desktops.

After patch, it converges in ~10 seconds for 60MB/s writes and 4GB mem.
So expect ~1s for a fast 600MB/s storage under 4GB mem, or ~4s under
16GB mem, which seems reasonable.

$ while true; do grep BdiDirtyThresh /debug/bdi/8:0/stats; sleep 1; done
BdiDirtyThresh:            0 kB
BdiDirtyThresh:       118748 kB
BdiDirtyThresh:       214280 kB
BdiDirtyThresh:       303868 kB
BdiDirtyThresh:       376528 kB
BdiDirtyThresh:       411180 kB
BdiDirtyThresh:       448636 kB
BdiDirtyThresh:       472260 kB
BdiDirtyThresh:       490924 kB
BdiDirtyThresh:       499596 kB
BdiDirtyThresh:       507068 kB
...
DirtyThresh:          530392 kB

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
@@ -145,7 +145,7 @@ static int calc_period_shift(void)
 	else
 		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
 				100;
-	return 2 + ilog2(dirty_total - 1);
+	return ilog2(dirty_total - 1) - 1;
 }
 
 /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 05/35] writeback: IO-less balance_dirty_pages()
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:46   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Chris Mason, Dave Chinner, Peter Zijlstra, Jens Axboe,
	Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Theodore Ts'o, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bw-throttle.patch --]
[-- Type: text/plain, Size: 29277 bytes --]

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

This patch introduces the basic framework, which will be further
consolidated by the next patches.

RATIONALS
=========

The current balance_dirty_pages() is rather IO inefficient.

- concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

For the above two reasons, it's much better to shift IO to the flusher
threads and let balance_dirty_pages() just wait for enough time or progress.

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than  10ms, which burns CPU power)
- avoid too large pause time (more than 200ms, which hurts responsiveness)
- avoid big fluctuations of pause times

For example, when doing a simple cp on ext4 with mem=4G HZ=250.

before patch, the pause time fluctuates from 0 to 324ms
(and the stall time may grow very large for slow devices)

[ 1237.139962] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
[ 1237.207489] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.225190] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.234488] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.244692] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.375231] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1237.443035] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1237.574630] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1237.642394] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1237.666320] balance_dirty_pages: write_chunk=1536 pages_written=57 pause=5
[ 1237.973365] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=81
[ 1238.212626] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
[ 1238.280431] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1238.412029] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1238.412791] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0

after patch, the pause time remains stable around 32ms

cp-2687  [002]  1452.237012: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [002]  1452.246157: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [006]  1452.253043: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [006]  1452.261899: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [006]  1452.268939: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [002]  1452.276932: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [002]  1452.285889: balance_dirty_pages: weight=57% dirtied=128 pause=8

CONTROL SYSTEM
==============

The current task_dirty_limit() adjusts bdi_dirty_limit to get
task_dirty_limit according to the dirty "weight" of the current task,
which is the percent of pages recently dirtied by the task. If 100%
pages are recently dirtied by the task, it will lower bdi_dirty_limit by
1/8. If only 1% pages are dirtied by the task, it will return almost
unmodified bdi_dirty_limit. In this way, a heavy dirtier will get
blocked at task_dirty_limit=(bdi_dirty_limit-bdi_dirty_limit/8) while
allowing a light dirtier to progress (the latter won't be blocked
because R << B in fig.1).

Fig.1 before patch, a heavy dirtier and a light dirtier
                                                R
----------------------------------------------+-o---------------------------*--|
                                              L A                           B  T
  T: bdi_dirty_limit, as returned by bdi_dirty_limit()
  L: T - T/8

  R: bdi_reclaimable + bdi_writeback

  A: task_dirty_limit for a heavy dirtier ~= R ~= L
  B: task_dirty_limit for a light dirtier ~= T

Since each process has its own dirty limit, we reuse A/B for the tasks as
well as their dirty limits.

If B is a newly started heavy dirtier, then it will slowly gain weight
and A will lose weight.  The task_dirty_limit for A and B will be
approaching the center of region (L, T) and eventually stabilize there.

Fig.2 before patch, two heavy dirtiers converging to the same threshold
                                                             R
----------------------------------------------+--------------o-*---------------|
                                              L              A B               T

Fig.3 after patch, one heavy dirtier
                                                |
    throttle_bandwidth ~= bdi_bandwidth  =>     o
                                                | o
                                                |   o
                                                |     o
                                                |       o
                                                |         o
                                              La|           o
----------------------------------------------+-+-------------o----------------|
                                                R             A                T
  T: bdi_dirty_limit
  A: task_dirty_limit      = T - Wa * T/16
  La: task_throttle_thresh = A - A/16

  R: bdi_dirty_pages = bdi_reclaimable + bdi_writeback ~= La

Now for IO-less balance_dirty_pages(), let's do it in a "bandwidth control"
way. In fig.3, a soft dirty limit region (La, A) is introduced. When R enters
this region, the task may be throttled for J jiffies on every N pages it dirtied.
Let's call (N/J) the "throttle bandwidth". It is computed by the following formula:

        throttle_bandwidth = bdi_bandwidth * (A - R) / (A - La)
where
	A = T - Wa * T/16
        La = A - A/16
where Wa is task weight for A. It's 0 for very light dirtier and 1 for
the one heavy dirtier (that consumes 100% bdi write bandwidth).  The
task weight will be updated independently by task_dirty_inc() at
set_page_dirty() time.

When R < La, we don't throttle it at all.
When R > A, the code will detect the negativeness and choose to pause
200ms (the upper pause boundary), then loop over again.

The 200ms max pause time helps reduce overheads in server workloads
with lots of concurrent dirtier tasks.

PSEUDO CODE
===========

balance_dirty_pages():

	/* soft throttling */
	if (task_throttle_thresh exceeded)
		sleep (task_dirtied_pages / throttle_bandwidth)

	/* hard throttling */
	while (task_dirty_limit exceeded) {
		sleep 200ms
		if (bdi_dirty_pages dropped more than task_dirtied_pages)
			break
	}

	/* global hard limit */
	while (dirty_limit exceeded)
		sleep 200ms

Basically there are three level of throttling now.

- normally the dirtier will be adaptively throttled with good timing

- when task_dirty_limit is exceeded, the task will be throttled until
  bdi dirty/writeback pages go down reasonably large

- when dirty_thresh is exceeded, the task can be throttled for arbitrary
  long time


BEHAVIOR CHANGE
===============

Users will notice that the applications will get throttled once the
crossing the global (background + dirty)/2=15% threshold. For a single
"cp", it could be soft throttled at 8*bdi->write_bandwidth around 15%
dirty pages, and be balanced at speed bdi->write_bandwidth around 17.5%
dirty pages. Before patch, the behavior is to just throttle it at 17.5%
dirty pages.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than ~15% memory.


BENCHMARKS
==========

The test box has a 4-core 3.2GHz CPU, 4GB mem and a SATA disk.

For each filesystem, the following command is run 3 times.

time (dd if=/dev/zero of=/tmp/10G bs=1M count=10240; sync); rm /tmp/10G

	    2.6.36-rc2-mm1	2.6.36-rc2-mm1+balance_dirty_pages
average real time
ext2        236.377s            232.144s              -1.8%
ext3        226.245s            225.751s              -0.2%
ext4        178.742s            179.343s              +0.3%
xfs         183.562s            179.808s              -2.0%
btrfs       179.044s            179.461s              +0.2%
NFS         645.627s            628.937s              -2.6%

average system time
ext2         22.142s             19.656s             -11.2%
ext3         34.175s             32.462s              -5.0%
ext4         23.440s             21.162s              -9.7%
xfs          19.089s             16.069s             -15.8%
btrfs        12.212s             11.670s              -4.4%
NFS          16.807s             17.410s              +3.6%

total user time
sum           0.136s              0.084s             -38.2%

In a more recent run of the tests, it's in fact slightly slower.

ext2         49.500 MB/s         49.200 MB/s          -0.6%
ext3         50.133 MB/s         50.000 MB/s          -0.3%
ext4         64.000 MB/s         63.200 MB/s          -1.2%
xfs          63.500 MB/s         63.167 MB/s          -0.5%
btrfs        63.133 MB/s         63.033 MB/s          -0.2%
NFS          16.833 MB/s         16.867 MB/s          +0.2%

In general there are no big IO performance changes for desktop users,
except for some noticeable reduction of CPU overheads. It mainly
benefits file servers with heavy concurrent writers on fast storage
arrays. As can be demonstrated by 10/100 concurrent dd's on xfs:

- 1 dirtier case:    the same
- 10 dirtiers case:  CPU system time is reduced to 50%
- 100 dirtiers case: CPU system time is reduced to 10%, IO size and throughput increases by 10%

			2.6.37-rc2				2.6.37-rc1-next-20101115+
        ----------------------------------------        ----------------------------------------
	%system		wkB/s		avgrq-sz	%system		wkB/s		avgrq-sz
100dd	30.916		37843.000	748.670		3.079		41654.853	822.322
100dd	30.501		37227.521	735.754		3.744		41531.725	820.360

10dd	39.442		47745.021	900.935		20.756		47951.702	901.006
10dd	39.204		47484.616	899.330		20.550		47970.093	900.247

1dd	13.046		57357.468	910.659		13.060		57632.715	909.212
1dd	12.896		56433.152	909.861		12.467		56294.440	909.644

The CPU overheads in 2.6.37-rc1-next-20101115+ is higher than
2.6.36-rc2-mm1+balance_dirty_pages, this may be due to the pause time
stablizing at lower values due to some algorithm adjustments (eg.
reduce the minimal pause time from 10ms to 1jiffy in new version)
leading to much more balance_dirty_pages() calls. The different pause
time also explains the different system time for 1/10/100dd cases on
the same 2.6.37-rc1-next-20101115+.

Andrew Morton <akpm@linux-foundation.org>:

Using TASK_INTERRUPTIBLE in balance_dirty_pages() seems wrong.  If it's
going to do that then it must break out if signal_pending(), otherwise
it's pretty much guaranteed to degenerate into a busywait loop.  Plus
we *do* want these processes to appear in D state and to contribute to
load average.

So it should be TASK_UNINTERRUPTIBLE.

CC: Chris Mason <chris.mason@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/filesystems/writeback-throttling-design.txt |  210 ++++++++++
 include/linux/writeback.h                                 |   10 
 mm/page-writeback.c                                       |   84 +---
 3 files changed, 251 insertions(+), 53 deletions(-)

--- linux-next.orig/include/linux/writeback.h	2010-12-13 21:46:10.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-12-13 21:46:12.000000000 +0800
@@ -12,6 +12,16 @@ struct backing_dev_info;
 extern spinlock_t inode_lock;
 
 /*
+ * The 1/8 region under the bdi dirty threshold is set aside for elastic
+ * throttling. In rare cases when the threshold is exceeded, more rigid
+ * throttling will be imposed, which will inevitably stall the dirtier task
+ * for seconds (or more) at _one_ time. The rare case could be a fork bomb
+ * where every new task dirties some more pages.
+ */
+#define BDI_SOFT_DIRTY_LIMIT	8
+#define TASK_SOFT_DIRTY_LIMIT	(BDI_SOFT_DIRTY_LIMIT * 2)
+
+/*
  * fs/fs-writeback.c
  */
 enum writeback_sync_modes {
--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:12.000000000 +0800
@@ -43,18 +43,9 @@
 static long ratelimit_pages = 32;
 
 /*
- * When balance_dirty_pages decides that the caller needs to perform some
- * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than dirtied pages to ensure that reasonably
- * large amounts of I/O are submitted.
+ * Don't sleep more than 200ms at a time in balance_dirty_pages().
  */
-static inline long sync_writeback_pages(unsigned long dirtied)
-{
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
-	return dirtied + dirtied / 2;
-}
+#define MAX_PAUSE	max(HZ/5, 1)
 
 /* The following parameters are exported via /proc/sys/vm */
 
@@ -279,7 +270,7 @@ static unsigned long task_dirty_limit(st
 {
 	long numerator, denominator;
 	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty >> 3;
+	u64 inv = dirty / TASK_SOFT_DIRTY_LIMIT;
 
 	task_dirties_fraction(tsk, &numerator, &denominator);
 	inv *= numerator;
@@ -509,26 +500,25 @@ unsigned long bdi_dirty_limit(struct bac
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
 	long nr_reclaimable, bdi_nr_reclaimable;
 	long nr_writeback, bdi_nr_writeback;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
+	unsigned long bw;
+	unsigned long pause;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
 	for (;;) {
-		struct writeback_control wbc = {
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-			.range_cyclic	= 1,
-		};
-
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_writeback = global_page_state(NR_WRITEBACK);
@@ -566,6 +556,23 @@ static void balance_dirty_pages(struct a
 			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
+		if (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh) {
+			pause = MAX_PAUSE;
+			goto pause;
+		}
+
+		bw = 100 << 20; /* use static 100MB/s for the moment */
+
+		bw = bw * (bdi_thresh - (bdi_nr_reclaimable + bdi_nr_writeback));
+		do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
+
+		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
+		pause = clamp_val(pause, 1, MAX_PAUSE);
+
+pause:
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+		io_schedule_timeout(pause);
+
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the
 		 * global "hard" limit. The former helps to prevent heavy IO
@@ -581,35 +588,6 @@ static void balance_dirty_pages(struct a
 
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
-
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_wbc_balance_dirty_start(&wbc, bdi);
-		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes_wb(&bdi->wb, &wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			trace_wbc_balance_dirty_written(&wbc, bdi);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
-		}
-		trace_wbc_balance_dirty_wait(&wbc, bdi);
-		__set_current_state(TASK_INTERRUPTIBLE);
-		io_schedule_timeout(pause);
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
 	}
 
 	if (!dirty_exceeded && bdi->dirty_exceeded)
@@ -626,7 +604,7 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
+	if ((laptop_mode && dirty_exceeded) ||
 	    (!laptop_mode && (nr_reclaimable > background_thresh)))
 		bdi_start_background_writeback(bdi);
 }
@@ -675,7 +653,7 @@ void balance_dirty_pages_ratelimited_nr(
 	p =  &__get_cpu_var(bdp_ratelimits);
 	*p += nr_pages_dirtied;
 	if (unlikely(*p >= ratelimit)) {
-		ratelimit = sync_writeback_pages(*p);
+		ratelimit = *p;
 		*p = 0;
 		preempt_enable();
 		balance_dirty_pages(mapping, ratelimit);
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-next/Documentation/filesystems/writeback-throttling-design.txt	2010-12-13 21:46:12.000000000 +0800
@@ -0,0 +1,210 @@
+writeback throttling design
+---------------------------
+
+introduction to dirty throttling
+--------------------------------
+
+The write(2) is normally buffered write that creates dirty page cache pages
+for holding the data and return immediately. The dirty pages will eventually
+be written to disk, or be dropped by unlink()/truncate().
+
+The delayed writeback of dirty pages enables the kernel to optimize the IO:
+
+- turn IO into async ones, which avoids blocking the tasks
+- submit IO as a batch for better throughput
+- avoid IO at all for temp files
+
+However, there have to be some limits on the number of allowable dirty pages.
+Typically applications are able to dirty pages more quickly than storage
+devices can write them. When approaching the dirty limits, the dirtier tasks
+will be throttled (put to brief sleeps from time to time) by
+balance_dirty_pages() in order to balance the dirty speed and writeback speed.
+
+dirty limits
+------------
+
+The dirty limit defaults to 20% reclaimable memory, and can be tuned via one of
+the following sysctl interfaces:
+
+	/proc/sys/vm/dirty_ratio
+	/proc/sys/vm/dirty_bytes
+
+The ultimate goal of balance_dirty_pages() is to keep the global dirty pages
+under control.
+
+	dirty_limit = dirty_ratio * free_reclaimable_pages
+
+However a global threshold may create deadlock for stacked BDIs (loop, FUSE and
+local NFS mounts). When A writes to B, and A generates enough dirty pages to
+get throttled, B will never start writeback until the dirty pages go away.
+
+Another problem is inter device starvation. When there are concurrent writes to
+a slow device and a fast one, the latter may well be starved due to unnecessary
+throttling on its dirtier tasks, leading to big IO performance drop.
+
+The solution is to split the global dirty limit into per-bdi limits among all
+the backing devices and scale writeback cache per backing device, proportional
+to its writeout speed.
+
+	bdi_dirty_limit = bdi_weight * dirty_limit
+
+where bdi_weight (ranging from 0 to 1) reflects the recent writeout speed of
+the BDI.
+
+We further scale the bdi dirty limit inversly with the task's dirty rate.
+This makes heavy writers have a lower dirty limit than the occasional writer,
+to prevent a heavy dd from slowing down all other light writers in the system.
+
+	task_dirty_limit = bdi_dirty_limit - task_weight * bdi_dirty_limit/16
+
+pause time
+----------
+
+The main task of dirty throttling is to determine when and how long to pause
+the current dirtier task.  Basically we want to
+
+- avoid too small pause time (less than 1 jiffy, which burns CPU power)
+- avoid too large pause time (more than 200ms, which hurts responsiveness)
+- avoid big fluctuations of pause times
+
+To smoothly control the pause time, we do soft throttling in a small region
+under task_dirty_limit, starting from
+
+	task_throttle_thresh = task_dirty_limit - task_dirty_limit/16
+
+In fig.1, when bdi_dirty_pages falls into
+
+    [0, La]:    do nothing
+    [La, A]:    do soft throttling
+    [A, inf]:   do hard throttling
+
+Where hard throttling is to wait until bdi_dirty_pages falls more than
+task_dirtied_pages (the pages dirtied by the task since its last throttle
+time). It's "hard" because it may end up waiting for long time.
+
+Fig.1 dirty throttling regions
+                                              o
+                                                o
+                                                  o
+                                                    o
+                                                      o
+                                                        o
+                                                          o
+                                                            o
+----------------------------------------------+---------------o----------------|
+                                              La              A                T
+                no throttle                     soft throttle   hard throttle
+  T: bdi_dirty_limit
+  A: task_dirty_limit      = T - task_weight * T/16
+  La: task_throttle_thresh = A - A/16
+
+Soft dirty throttling is to pause the dirtier task for J:pause_time jiffies on
+every N:task_dirtied_pages pages it dirtied.  Let's call (N/J) the "throttle
+bandwidth". It is computed by the following formula:
+
+                                     task_dirty_limit - bdi_dirty_pages
+throttle_bandwidth = bdi_bandwidth * ----------------------------------
+                                           task_dirty_limit/16
+
+where bdi_bandwidth is the BDI's estimated write speed.
+
+Given the throttle_bandwidth for a task, we select a suitable N, so that when
+the task dirties so much pages, it enters balance_dirty_pages() to sleep for
+roughly J jiffies. N is adaptive to storage and task write speeds, so that the
+task always get suitable (not too long or small) pause time.
+
+dynamics
+--------
+
+When there is one heavy dirtier, bdi_dirty_pages will keep growing until
+exceeding the low threshold of the task's soft throttling region [La, A].
+At which point (La) the task will be controlled under speed
+throttle_bandwidth=bdi_bandwidth (fig.2) and remain stable there.
+
+Fig.2 one heavy dirtier
+
+    throttle_bandwidth ~= bdi_bandwidth  =>   o
+                                              | o
+                                              |   o
+                                              |     o
+                                              |       o
+                                              |         o
+                                              |           o
+                                            La|             o
+----------------------------------------------+---------------o----------------|
+                                              R               A                T
+  R: bdi_dirty_pages ~= La
+
+When there comes a new dd task B, task_weight_B will gradually grow from 0 to
+50% while task_weight_A will decrease from 100% to 50%.  When task_weight_B is
+still small, B is considered a light dirtier and is allowed to dirty pages much
+faster than the bdi write bandwidth. In fact initially it won't be throttled at
+all when R < Lb where Lb = B - B/16 and B ~= T.
+
+Fig.3 an old dd (A) + a newly started dd (B)
+
+                      throttle bandwidth  =>    *
+                                                | *
+                                                |   *
+                                                |     *
+                                                |       *
+                                                |         *
+                                                |           *
+                                                |             *
+                      throttle bandwidth  =>    o               *
+                                                | o               *
+                                                |   o               *
+                                                |     o               *
+                                                |       o               *
+                                                |         o               *
+                                                |           o               *
+------------------------------------------------+-------------o---------------*|
+                                                R             A               BT
+
+So R:bdi_dirty_pages will grow large. As task_weight_A and task_weight_B
+converge to 50%, the points A, B will go towards each other (fig.4) and
+eventually coincide with each other. R will stabilize around A-A/32 where
+A=B=T-0.5*T/16.  throttle_bandwidth will stabilize around bdi_bandwidth/2.
+
+Note that the application "think+dirty time" is ignored for simplicity in the
+above discussions. With non-zero user space think time, the balance point will
+slightly drift and not a big deal otherwise.
+
+Fig.4 the two dd's converging to the same bandwidth
+
+                                                         |
+                                 throttle bandwidth  =>  *
+                                                         | *
+                                 throttle bandwidth  =>  o   *
+                                                         | o   *
+                                                         |   o   *
+                                                         |     o   *
+                                                         |       o   *
+                                                         |         o   *
+---------------------------------------------------------+-----------o---*-----|
+                                                         R           A   B     T
+
+There won't be big oscillations between A and B, because as soon as A coincides
+with B, their throttle_bandwidth and hence dirty speed will be equal, A's
+weight will stop decreasing and B's weight will stop growing, so the two points
+won't keep moving and cross each other.
+
+Sure there are always oscillations of bdi_dirty_pages as long as the dirtier
+task alternatively do dirty and pause. But it will be bounded. When there is 1
+heavy dirtier, the error bound will be (pause_time * bdi_bandwidth). When there
+are 2 heavy dirtiers, the max error is 2 * (pause_time * bdi_bandwidth/2),
+which remains the same as 1 dirtier case (given the same pause time). In fact
+the more dirtier tasks, the less errors will be, since the dirtier tasks are
+not likely going to sleep at the same time.
+
+References
+----------
+
+Smarter write throttling
+http://lwn.net/Articles/245600/
+
+Flushing out pdflush
+http://lwn.net/Articles/326552/
+
+Dirty throttling slides
+http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling.pdf



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 05/35] writeback: IO-less balance_dirty_pages()
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Chris Mason, Dave Chinner, Peter Zijlstra, Jens Axboe,
	Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Theodore Ts'o, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bw-throttle.patch --]
[-- Type: text/plain, Size: 29573 bytes --]

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

This patch introduces the basic framework, which will be further
consolidated by the next patches.

RATIONALS
=========

The current balance_dirty_pages() is rather IO inefficient.

- concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

For the above two reasons, it's much better to shift IO to the flusher
threads and let balance_dirty_pages() just wait for enough time or progress.

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than  10ms, which burns CPU power)
- avoid too large pause time (more than 200ms, which hurts responsiveness)
- avoid big fluctuations of pause times

For example, when doing a simple cp on ext4 with mem=4G HZ=250.

before patch, the pause time fluctuates from 0 to 324ms
(and the stall time may grow very large for slow devices)

[ 1237.139962] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
[ 1237.207489] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.225190] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.234488] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.244692] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.375231] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1237.443035] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1237.574630] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1237.642394] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1237.666320] balance_dirty_pages: write_chunk=1536 pages_written=57 pause=5
[ 1237.973365] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=81
[ 1238.212626] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
[ 1238.280431] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1238.412029] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1238.412791] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0

after patch, the pause time remains stable around 32ms

cp-2687  [002]  1452.237012: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [002]  1452.246157: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [006]  1452.253043: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [006]  1452.261899: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [006]  1452.268939: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [002]  1452.276932: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [002]  1452.285889: balance_dirty_pages: weight=57% dirtied=128 pause=8

CONTROL SYSTEM
==============

The current task_dirty_limit() adjusts bdi_dirty_limit to get
task_dirty_limit according to the dirty "weight" of the current task,
which is the percent of pages recently dirtied by the task. If 100%
pages are recently dirtied by the task, it will lower bdi_dirty_limit by
1/8. If only 1% pages are dirtied by the task, it will return almost
unmodified bdi_dirty_limit. In this way, a heavy dirtier will get
blocked at task_dirty_limit=(bdi_dirty_limit-bdi_dirty_limit/8) while
allowing a light dirtier to progress (the latter won't be blocked
because R << B in fig.1).

Fig.1 before patch, a heavy dirtier and a light dirtier
                                                R
----------------------------------------------+-o---------------------------*--|
                                              L A                           B  T
  T: bdi_dirty_limit, as returned by bdi_dirty_limit()
  L: T - T/8

  R: bdi_reclaimable + bdi_writeback

  A: task_dirty_limit for a heavy dirtier ~= R ~= L
  B: task_dirty_limit for a light dirtier ~= T

Since each process has its own dirty limit, we reuse A/B for the tasks as
well as their dirty limits.

If B is a newly started heavy dirtier, then it will slowly gain weight
and A will lose weight.  The task_dirty_limit for A and B will be
approaching the center of region (L, T) and eventually stabilize there.

Fig.2 before patch, two heavy dirtiers converging to the same threshold
                                                             R
----------------------------------------------+--------------o-*---------------|
                                              L              A B               T

Fig.3 after patch, one heavy dirtier
                                                |
    throttle_bandwidth ~= bdi_bandwidth  =>     o
                                                | o
                                                |   o
                                                |     o
                                                |       o
                                                |         o
                                              La|           o
----------------------------------------------+-+-------------o----------------|
                                                R             A                T
  T: bdi_dirty_limit
  A: task_dirty_limit      = T - Wa * T/16
  La: task_throttle_thresh = A - A/16

  R: bdi_dirty_pages = bdi_reclaimable + bdi_writeback ~= La

Now for IO-less balance_dirty_pages(), let's do it in a "bandwidth control"
way. In fig.3, a soft dirty limit region (La, A) is introduced. When R enters
this region, the task may be throttled for J jiffies on every N pages it dirtied.
Let's call (N/J) the "throttle bandwidth". It is computed by the following formula:

        throttle_bandwidth = bdi_bandwidth * (A - R) / (A - La)
where
	A = T - Wa * T/16
        La = A - A/16
where Wa is task weight for A. It's 0 for very light dirtier and 1 for
the one heavy dirtier (that consumes 100% bdi write bandwidth).  The
task weight will be updated independently by task_dirty_inc() at
set_page_dirty() time.

When R < La, we don't throttle it at all.
When R > A, the code will detect the negativeness and choose to pause
200ms (the upper pause boundary), then loop over again.

The 200ms max pause time helps reduce overheads in server workloads
with lots of concurrent dirtier tasks.

PSEUDO CODE
===========

balance_dirty_pages():

	/* soft throttling */
	if (task_throttle_thresh exceeded)
		sleep (task_dirtied_pages / throttle_bandwidth)

	/* hard throttling */
	while (task_dirty_limit exceeded) {
		sleep 200ms
		if (bdi_dirty_pages dropped more than task_dirtied_pages)
			break
	}

	/* global hard limit */
	while (dirty_limit exceeded)
		sleep 200ms

Basically there are three level of throttling now.

- normally the dirtier will be adaptively throttled with good timing

- when task_dirty_limit is exceeded, the task will be throttled until
  bdi dirty/writeback pages go down reasonably large

- when dirty_thresh is exceeded, the task can be throttled for arbitrary
  long time


BEHAVIOR CHANGE
===============

Users will notice that the applications will get throttled once the
crossing the global (background + dirty)/2=15% threshold. For a single
"cp", it could be soft throttled at 8*bdi->write_bandwidth around 15%
dirty pages, and be balanced at speed bdi->write_bandwidth around 17.5%
dirty pages. Before patch, the behavior is to just throttle it at 17.5%
dirty pages.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than ~15% memory.


BENCHMARKS
==========

The test box has a 4-core 3.2GHz CPU, 4GB mem and a SATA disk.

For each filesystem, the following command is run 3 times.

time (dd if=/dev/zero of=/tmp/10G bs=1M count=10240; sync); rm /tmp/10G

	    2.6.36-rc2-mm1	2.6.36-rc2-mm1+balance_dirty_pages
average real time
ext2        236.377s            232.144s              -1.8%
ext3        226.245s            225.751s              -0.2%
ext4        178.742s            179.343s              +0.3%
xfs         183.562s            179.808s              -2.0%
btrfs       179.044s            179.461s              +0.2%
NFS         645.627s            628.937s              -2.6%

average system time
ext2         22.142s             19.656s             -11.2%
ext3         34.175s             32.462s              -5.0%
ext4         23.440s             21.162s              -9.7%
xfs          19.089s             16.069s             -15.8%
btrfs        12.212s             11.670s              -4.4%
NFS          16.807s             17.410s              +3.6%

total user time
sum           0.136s              0.084s             -38.2%

In a more recent run of the tests, it's in fact slightly slower.

ext2         49.500 MB/s         49.200 MB/s          -0.6%
ext3         50.133 MB/s         50.000 MB/s          -0.3%
ext4         64.000 MB/s         63.200 MB/s          -1.2%
xfs          63.500 MB/s         63.167 MB/s          -0.5%
btrfs        63.133 MB/s         63.033 MB/s          -0.2%
NFS          16.833 MB/s         16.867 MB/s          +0.2%

In general there are no big IO performance changes for desktop users,
except for some noticeable reduction of CPU overheads. It mainly
benefits file servers with heavy concurrent writers on fast storage
arrays. As can be demonstrated by 10/100 concurrent dd's on xfs:

- 1 dirtier case:    the same
- 10 dirtiers case:  CPU system time is reduced to 50%
- 100 dirtiers case: CPU system time is reduced to 10%, IO size and throughput increases by 10%

			2.6.37-rc2				2.6.37-rc1-next-20101115+
        ----------------------------------------        ----------------------------------------
	%system		wkB/s		avgrq-sz	%system		wkB/s		avgrq-sz
100dd	30.916		37843.000	748.670		3.079		41654.853	822.322
100dd	30.501		37227.521	735.754		3.744		41531.725	820.360

10dd	39.442		47745.021	900.935		20.756		47951.702	901.006
10dd	39.204		47484.616	899.330		20.550		47970.093	900.247

1dd	13.046		57357.468	910.659		13.060		57632.715	909.212
1dd	12.896		56433.152	909.861		12.467		56294.440	909.644

The CPU overheads in 2.6.37-rc1-next-20101115+ is higher than
2.6.36-rc2-mm1+balance_dirty_pages, this may be due to the pause time
stablizing at lower values due to some algorithm adjustments (eg.
reduce the minimal pause time from 10ms to 1jiffy in new version)
leading to much more balance_dirty_pages() calls. The different pause
time also explains the different system time for 1/10/100dd cases on
the same 2.6.37-rc1-next-20101115+.

Andrew Morton <akpm@linux-foundation.org>:

Using TASK_INTERRUPTIBLE in balance_dirty_pages() seems wrong.  If it's
going to do that then it must break out if signal_pending(), otherwise
it's pretty much guaranteed to degenerate into a busywait loop.  Plus
we *do* want these processes to appear in D state and to contribute to
load average.

So it should be TASK_UNINTERRUPTIBLE.

CC: Chris Mason <chris.mason@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/filesystems/writeback-throttling-design.txt |  210 ++++++++++
 include/linux/writeback.h                                 |   10 
 mm/page-writeback.c                                       |   84 +---
 3 files changed, 251 insertions(+), 53 deletions(-)

--- linux-next.orig/include/linux/writeback.h	2010-12-13 21:46:10.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-12-13 21:46:12.000000000 +0800
@@ -12,6 +12,16 @@ struct backing_dev_info;
 extern spinlock_t inode_lock;
 
 /*
+ * The 1/8 region under the bdi dirty threshold is set aside for elastic
+ * throttling. In rare cases when the threshold is exceeded, more rigid
+ * throttling will be imposed, which will inevitably stall the dirtier task
+ * for seconds (or more) at _one_ time. The rare case could be a fork bomb
+ * where every new task dirties some more pages.
+ */
+#define BDI_SOFT_DIRTY_LIMIT	8
+#define TASK_SOFT_DIRTY_LIMIT	(BDI_SOFT_DIRTY_LIMIT * 2)
+
+/*
  * fs/fs-writeback.c
  */
 enum writeback_sync_modes {
--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:12.000000000 +0800
@@ -43,18 +43,9 @@
 static long ratelimit_pages = 32;
 
 /*
- * When balance_dirty_pages decides that the caller needs to perform some
- * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than dirtied pages to ensure that reasonably
- * large amounts of I/O are submitted.
+ * Don't sleep more than 200ms at a time in balance_dirty_pages().
  */
-static inline long sync_writeback_pages(unsigned long dirtied)
-{
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
-	return dirtied + dirtied / 2;
-}
+#define MAX_PAUSE	max(HZ/5, 1)
 
 /* The following parameters are exported via /proc/sys/vm */
 
@@ -279,7 +270,7 @@ static unsigned long task_dirty_limit(st
 {
 	long numerator, denominator;
 	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty >> 3;
+	u64 inv = dirty / TASK_SOFT_DIRTY_LIMIT;
 
 	task_dirties_fraction(tsk, &numerator, &denominator);
 	inv *= numerator;
@@ -509,26 +500,25 @@ unsigned long bdi_dirty_limit(struct bac
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
 	long nr_reclaimable, bdi_nr_reclaimable;
 	long nr_writeback, bdi_nr_writeback;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
+	unsigned long bw;
+	unsigned long pause;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
 	for (;;) {
-		struct writeback_control wbc = {
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-			.range_cyclic	= 1,
-		};
-
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_writeback = global_page_state(NR_WRITEBACK);
@@ -566,6 +556,23 @@ static void balance_dirty_pages(struct a
 			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
+		if (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh) {
+			pause = MAX_PAUSE;
+			goto pause;
+		}
+
+		bw = 100 << 20; /* use static 100MB/s for the moment */
+
+		bw = bw * (bdi_thresh - (bdi_nr_reclaimable + bdi_nr_writeback));
+		do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
+
+		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
+		pause = clamp_val(pause, 1, MAX_PAUSE);
+
+pause:
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+		io_schedule_timeout(pause);
+
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the
 		 * global "hard" limit. The former helps to prevent heavy IO
@@ -581,35 +588,6 @@ static void balance_dirty_pages(struct a
 
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
-
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_wbc_balance_dirty_start(&wbc, bdi);
-		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes_wb(&bdi->wb, &wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			trace_wbc_balance_dirty_written(&wbc, bdi);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
-		}
-		trace_wbc_balance_dirty_wait(&wbc, bdi);
-		__set_current_state(TASK_INTERRUPTIBLE);
-		io_schedule_timeout(pause);
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
 	}
 
 	if (!dirty_exceeded && bdi->dirty_exceeded)
@@ -626,7 +604,7 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
+	if ((laptop_mode && dirty_exceeded) ||
 	    (!laptop_mode && (nr_reclaimable > background_thresh)))
 		bdi_start_background_writeback(bdi);
 }
@@ -675,7 +653,7 @@ void balance_dirty_pages_ratelimited_nr(
 	p =  &__get_cpu_var(bdp_ratelimits);
 	*p += nr_pages_dirtied;
 	if (unlikely(*p >= ratelimit)) {
-		ratelimit = sync_writeback_pages(*p);
+		ratelimit = *p;
 		*p = 0;
 		preempt_enable();
 		balance_dirty_pages(mapping, ratelimit);
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-next/Documentation/filesystems/writeback-throttling-design.txt	2010-12-13 21:46:12.000000000 +0800
@@ -0,0 +1,210 @@
+writeback throttling design
+---------------------------
+
+introduction to dirty throttling
+--------------------------------
+
+The write(2) is normally buffered write that creates dirty page cache pages
+for holding the data and return immediately. The dirty pages will eventually
+be written to disk, or be dropped by unlink()/truncate().
+
+The delayed writeback of dirty pages enables the kernel to optimize the IO:
+
+- turn IO into async ones, which avoids blocking the tasks
+- submit IO as a batch for better throughput
+- avoid IO at all for temp files
+
+However, there have to be some limits on the number of allowable dirty pages.
+Typically applications are able to dirty pages more quickly than storage
+devices can write them. When approaching the dirty limits, the dirtier tasks
+will be throttled (put to brief sleeps from time to time) by
+balance_dirty_pages() in order to balance the dirty speed and writeback speed.
+
+dirty limits
+------------
+
+The dirty limit defaults to 20% reclaimable memory, and can be tuned via one of
+the following sysctl interfaces:
+
+	/proc/sys/vm/dirty_ratio
+	/proc/sys/vm/dirty_bytes
+
+The ultimate goal of balance_dirty_pages() is to keep the global dirty pages
+under control.
+
+	dirty_limit = dirty_ratio * free_reclaimable_pages
+
+However a global threshold may create deadlock for stacked BDIs (loop, FUSE and
+local NFS mounts). When A writes to B, and A generates enough dirty pages to
+get throttled, B will never start writeback until the dirty pages go away.
+
+Another problem is inter device starvation. When there are concurrent writes to
+a slow device and a fast one, the latter may well be starved due to unnecessary
+throttling on its dirtier tasks, leading to big IO performance drop.
+
+The solution is to split the global dirty limit into per-bdi limits among all
+the backing devices and scale writeback cache per backing device, proportional
+to its writeout speed.
+
+	bdi_dirty_limit = bdi_weight * dirty_limit
+
+where bdi_weight (ranging from 0 to 1) reflects the recent writeout speed of
+the BDI.
+
+We further scale the bdi dirty limit inversly with the task's dirty rate.
+This makes heavy writers have a lower dirty limit than the occasional writer,
+to prevent a heavy dd from slowing down all other light writers in the system.
+
+	task_dirty_limit = bdi_dirty_limit - task_weight * bdi_dirty_limit/16
+
+pause time
+----------
+
+The main task of dirty throttling is to determine when and how long to pause
+the current dirtier task.  Basically we want to
+
+- avoid too small pause time (less than 1 jiffy, which burns CPU power)
+- avoid too large pause time (more than 200ms, which hurts responsiveness)
+- avoid big fluctuations of pause times
+
+To smoothly control the pause time, we do soft throttling in a small region
+under task_dirty_limit, starting from
+
+	task_throttle_thresh = task_dirty_limit - task_dirty_limit/16
+
+In fig.1, when bdi_dirty_pages falls into
+
+    [0, La]:    do nothing
+    [La, A]:    do soft throttling
+    [A, inf]:   do hard throttling
+
+Where hard throttling is to wait until bdi_dirty_pages falls more than
+task_dirtied_pages (the pages dirtied by the task since its last throttle
+time). It's "hard" because it may end up waiting for long time.
+
+Fig.1 dirty throttling regions
+                                              o
+                                                o
+                                                  o
+                                                    o
+                                                      o
+                                                        o
+                                                          o
+                                                            o
+----------------------------------------------+---------------o----------------|
+                                              La              A                T
+                no throttle                     soft throttle   hard throttle
+  T: bdi_dirty_limit
+  A: task_dirty_limit      = T - task_weight * T/16
+  La: task_throttle_thresh = A - A/16
+
+Soft dirty throttling is to pause the dirtier task for J:pause_time jiffies on
+every N:task_dirtied_pages pages it dirtied.  Let's call (N/J) the "throttle
+bandwidth". It is computed by the following formula:
+
+                                     task_dirty_limit - bdi_dirty_pages
+throttle_bandwidth = bdi_bandwidth * ----------------------------------
+                                           task_dirty_limit/16
+
+where bdi_bandwidth is the BDI's estimated write speed.
+
+Given the throttle_bandwidth for a task, we select a suitable N, so that when
+the task dirties so much pages, it enters balance_dirty_pages() to sleep for
+roughly J jiffies. N is adaptive to storage and task write speeds, so that the
+task always get suitable (not too long or small) pause time.
+
+dynamics
+--------
+
+When there is one heavy dirtier, bdi_dirty_pages will keep growing until
+exceeding the low threshold of the task's soft throttling region [La, A].
+At which point (La) the task will be controlled under speed
+throttle_bandwidth=bdi_bandwidth (fig.2) and remain stable there.
+
+Fig.2 one heavy dirtier
+
+    throttle_bandwidth ~= bdi_bandwidth  =>   o
+                                              | o
+                                              |   o
+                                              |     o
+                                              |       o
+                                              |         o
+                                              |           o
+                                            La|             o
+----------------------------------------------+---------------o----------------|
+                                              R               A                T
+  R: bdi_dirty_pages ~= La
+
+When there comes a new dd task B, task_weight_B will gradually grow from 0 to
+50% while task_weight_A will decrease from 100% to 50%.  When task_weight_B is
+still small, B is considered a light dirtier and is allowed to dirty pages much
+faster than the bdi write bandwidth. In fact initially it won't be throttled at
+all when R < Lb where Lb = B - B/16 and B ~= T.
+
+Fig.3 an old dd (A) + a newly started dd (B)
+
+                      throttle bandwidth  =>    *
+                                                | *
+                                                |   *
+                                                |     *
+                                                |       *
+                                                |         *
+                                                |           *
+                                                |             *
+                      throttle bandwidth  =>    o               *
+                                                | o               *
+                                                |   o               *
+                                                |     o               *
+                                                |       o               *
+                                                |         o               *
+                                                |           o               *
+------------------------------------------------+-------------o---------------*|
+                                                R             A               BT
+
+So R:bdi_dirty_pages will grow large. As task_weight_A and task_weight_B
+converge to 50%, the points A, B will go towards each other (fig.4) and
+eventually coincide with each other. R will stabilize around A-A/32 where
+A=B=T-0.5*T/16.  throttle_bandwidth will stabilize around bdi_bandwidth/2.
+
+Note that the application "think+dirty time" is ignored for simplicity in the
+above discussions. With non-zero user space think time, the balance point will
+slightly drift and not a big deal otherwise.
+
+Fig.4 the two dd's converging to the same bandwidth
+
+                                                         |
+                                 throttle bandwidth  =>  *
+                                                         | *
+                                 throttle bandwidth  =>  o   *
+                                                         | o   *
+                                                         |   o   *
+                                                         |     o   *
+                                                         |       o   *
+                                                         |         o   *
+---------------------------------------------------------+-----------o---*-----|
+                                                         R           A   B     T
+
+There won't be big oscillations between A and B, because as soon as A coincides
+with B, their throttle_bandwidth and hence dirty speed will be equal, A's
+weight will stop decreasing and B's weight will stop growing, so the two points
+won't keep moving and cross each other.
+
+Sure there are always oscillations of bdi_dirty_pages as long as the dirtier
+task alternatively do dirty and pause. But it will be bounded. When there is 1
+heavy dirtier, the error bound will be (pause_time * bdi_bandwidth). When there
+are 2 heavy dirtiers, the max error is 2 * (pause_time * bdi_bandwidth/2),
+which remains the same as 1 dirtier case (given the same pause time). In fact
+the more dirtier tasks, the less errors will be, since the dirtier tasks are
+not likely going to sleep at the same time.
+
+References
+----------
+
+Smarter write throttling
+http://lwn.net/Articles/245600/
+
+Flushing out pdflush
+http://lwn.net/Articles/326552/
+
+Dirty throttling slides
+http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling.pdf


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 05/35] writeback: IO-less balance_dirty_pages()
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Chris Mason, Dave Chinner, Peter Zijlstra, Jens Axboe,
	Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Theodore Ts'o, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bw-throttle.patch --]
[-- Type: text/plain, Size: 29573 bytes --]

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

This patch introduces the basic framework, which will be further
consolidated by the next patches.

RATIONALS
=========

The current balance_dirty_pages() is rather IO inefficient.

- concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

For the above two reasons, it's much better to shift IO to the flusher
threads and let balance_dirty_pages() just wait for enough time or progress.

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than  10ms, which burns CPU power)
- avoid too large pause time (more than 200ms, which hurts responsiveness)
- avoid big fluctuations of pause times

For example, when doing a simple cp on ext4 with mem=4G HZ=250.

before patch, the pause time fluctuates from 0 to 324ms
(and the stall time may grow very large for slow devices)

[ 1237.139962] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
[ 1237.207489] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.225190] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.234488] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.244692] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.375231] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1237.443035] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1237.574630] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1237.642394] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1237.666320] balance_dirty_pages: write_chunk=1536 pages_written=57 pause=5
[ 1237.973365] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=81
[ 1238.212626] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
[ 1238.280431] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1238.412029] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1238.412791] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0

after patch, the pause time remains stable around 32ms

cp-2687  [002]  1452.237012: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [002]  1452.246157: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [006]  1452.253043: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [006]  1452.261899: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [006]  1452.268939: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [002]  1452.276932: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [002]  1452.285889: balance_dirty_pages: weight=57% dirtied=128 pause=8

CONTROL SYSTEM
==============

The current task_dirty_limit() adjusts bdi_dirty_limit to get
task_dirty_limit according to the dirty "weight" of the current task,
which is the percent of pages recently dirtied by the task. If 100%
pages are recently dirtied by the task, it will lower bdi_dirty_limit by
1/8. If only 1% pages are dirtied by the task, it will return almost
unmodified bdi_dirty_limit. In this way, a heavy dirtier will get
blocked at task_dirty_limit=(bdi_dirty_limit-bdi_dirty_limit/8) while
allowing a light dirtier to progress (the latter won't be blocked
because R << B in fig.1).

Fig.1 before patch, a heavy dirtier and a light dirtier
                                                R
----------------------------------------------+-o---------------------------*--|
                                              L A                           B  T
  T: bdi_dirty_limit, as returned by bdi_dirty_limit()
  L: T - T/8

  R: bdi_reclaimable + bdi_writeback

  A: task_dirty_limit for a heavy dirtier ~= R ~= L
  B: task_dirty_limit for a light dirtier ~= T

Since each process has its own dirty limit, we reuse A/B for the tasks as
well as their dirty limits.

If B is a newly started heavy dirtier, then it will slowly gain weight
and A will lose weight.  The task_dirty_limit for A and B will be
approaching the center of region (L, T) and eventually stabilize there.

Fig.2 before patch, two heavy dirtiers converging to the same threshold
                                                             R
----------------------------------------------+--------------o-*---------------|
                                              L              A B               T

Fig.3 after patch, one heavy dirtier
                                                |
    throttle_bandwidth ~= bdi_bandwidth  =>     o
                                                | o
                                                |   o
                                                |     o
                                                |       o
                                                |         o
                                              La|           o
----------------------------------------------+-+-------------o----------------|
                                                R             A                T
  T: bdi_dirty_limit
  A: task_dirty_limit      = T - Wa * T/16
  La: task_throttle_thresh = A - A/16

  R: bdi_dirty_pages = bdi_reclaimable + bdi_writeback ~= La

Now for IO-less balance_dirty_pages(), let's do it in a "bandwidth control"
way. In fig.3, a soft dirty limit region (La, A) is introduced. When R enters
this region, the task may be throttled for J jiffies on every N pages it dirtied.
Let's call (N/J) the "throttle bandwidth". It is computed by the following formula:

        throttle_bandwidth = bdi_bandwidth * (A - R) / (A - La)
where
	A = T - Wa * T/16
        La = A - A/16
where Wa is task weight for A. It's 0 for very light dirtier and 1 for
the one heavy dirtier (that consumes 100% bdi write bandwidth).  The
task weight will be updated independently by task_dirty_inc() at
set_page_dirty() time.

When R < La, we don't throttle it at all.
When R > A, the code will detect the negativeness and choose to pause
200ms (the upper pause boundary), then loop over again.

The 200ms max pause time helps reduce overheads in server workloads
with lots of concurrent dirtier tasks.

PSEUDO CODE
===========

balance_dirty_pages():

	/* soft throttling */
	if (task_throttle_thresh exceeded)
		sleep (task_dirtied_pages / throttle_bandwidth)

	/* hard throttling */
	while (task_dirty_limit exceeded) {
		sleep 200ms
		if (bdi_dirty_pages dropped more than task_dirtied_pages)
			break
	}

	/* global hard limit */
	while (dirty_limit exceeded)
		sleep 200ms

Basically there are three level of throttling now.

- normally the dirtier will be adaptively throttled with good timing

- when task_dirty_limit is exceeded, the task will be throttled until
  bdi dirty/writeback pages go down reasonably large

- when dirty_thresh is exceeded, the task can be throttled for arbitrary
  long time


BEHAVIOR CHANGE
===============

Users will notice that the applications will get throttled once the
crossing the global (background + dirty)/2=15% threshold. For a single
"cp", it could be soft throttled at 8*bdi->write_bandwidth around 15%
dirty pages, and be balanced at speed bdi->write_bandwidth around 17.5%
dirty pages. Before patch, the behavior is to just throttle it at 17.5%
dirty pages.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than ~15% memory.


BENCHMARKS
==========

The test box has a 4-core 3.2GHz CPU, 4GB mem and a SATA disk.

For each filesystem, the following command is run 3 times.

time (dd if=/dev/zero of=/tmp/10G bs=1M count=10240; sync); rm /tmp/10G

	    2.6.36-rc2-mm1	2.6.36-rc2-mm1+balance_dirty_pages
average real time
ext2        236.377s            232.144s              -1.8%
ext3        226.245s            225.751s              -0.2%
ext4        178.742s            179.343s              +0.3%
xfs         183.562s            179.808s              -2.0%
btrfs       179.044s            179.461s              +0.2%
NFS         645.627s            628.937s              -2.6%

average system time
ext2         22.142s             19.656s             -11.2%
ext3         34.175s             32.462s              -5.0%
ext4         23.440s             21.162s              -9.7%
xfs          19.089s             16.069s             -15.8%
btrfs        12.212s             11.670s              -4.4%
NFS          16.807s             17.410s              +3.6%

total user time
sum           0.136s              0.084s             -38.2%

In a more recent run of the tests, it's in fact slightly slower.

ext2         49.500 MB/s         49.200 MB/s          -0.6%
ext3         50.133 MB/s         50.000 MB/s          -0.3%
ext4         64.000 MB/s         63.200 MB/s          -1.2%
xfs          63.500 MB/s         63.167 MB/s          -0.5%
btrfs        63.133 MB/s         63.033 MB/s          -0.2%
NFS          16.833 MB/s         16.867 MB/s          +0.2%

In general there are no big IO performance changes for desktop users,
except for some noticeable reduction of CPU overheads. It mainly
benefits file servers with heavy concurrent writers on fast storage
arrays. As can be demonstrated by 10/100 concurrent dd's on xfs:

- 1 dirtier case:    the same
- 10 dirtiers case:  CPU system time is reduced to 50%
- 100 dirtiers case: CPU system time is reduced to 10%, IO size and throughput increases by 10%

			2.6.37-rc2				2.6.37-rc1-next-20101115+
        ----------------------------------------        ----------------------------------------
	%system		wkB/s		avgrq-sz	%system		wkB/s		avgrq-sz
100dd	30.916		37843.000	748.670		3.079		41654.853	822.322
100dd	30.501		37227.521	735.754		3.744		41531.725	820.360

10dd	39.442		47745.021	900.935		20.756		47951.702	901.006
10dd	39.204		47484.616	899.330		20.550		47970.093	900.247

1dd	13.046		57357.468	910.659		13.060		57632.715	909.212
1dd	12.896		56433.152	909.861		12.467		56294.440	909.644

The CPU overheads in 2.6.37-rc1-next-20101115+ is higher than
2.6.36-rc2-mm1+balance_dirty_pages, this may be due to the pause time
stablizing at lower values due to some algorithm adjustments (eg.
reduce the minimal pause time from 10ms to 1jiffy in new version)
leading to much more balance_dirty_pages() calls. The different pause
time also explains the different system time for 1/10/100dd cases on
the same 2.6.37-rc1-next-20101115+.

Andrew Morton <akpm@linux-foundation.org>:

Using TASK_INTERRUPTIBLE in balance_dirty_pages() seems wrong.  If it's
going to do that then it must break out if signal_pending(), otherwise
it's pretty much guaranteed to degenerate into a busywait loop.  Plus
we *do* want these processes to appear in D state and to contribute to
load average.

So it should be TASK_UNINTERRUPTIBLE.

CC: Chris Mason <chris.mason@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/filesystems/writeback-throttling-design.txt |  210 ++++++++++
 include/linux/writeback.h                                 |   10 
 mm/page-writeback.c                                       |   84 +---
 3 files changed, 251 insertions(+), 53 deletions(-)

--- linux-next.orig/include/linux/writeback.h	2010-12-13 21:46:10.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-12-13 21:46:12.000000000 +0800
@@ -12,6 +12,16 @@ struct backing_dev_info;
 extern spinlock_t inode_lock;
 
 /*
+ * The 1/8 region under the bdi dirty threshold is set aside for elastic
+ * throttling. In rare cases when the threshold is exceeded, more rigid
+ * throttling will be imposed, which will inevitably stall the dirtier task
+ * for seconds (or more) at _one_ time. The rare case could be a fork bomb
+ * where every new task dirties some more pages.
+ */
+#define BDI_SOFT_DIRTY_LIMIT	8
+#define TASK_SOFT_DIRTY_LIMIT	(BDI_SOFT_DIRTY_LIMIT * 2)
+
+/*
  * fs/fs-writeback.c
  */
 enum writeback_sync_modes {
--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:12.000000000 +0800
@@ -43,18 +43,9 @@
 static long ratelimit_pages = 32;
 
 /*
- * When balance_dirty_pages decides that the caller needs to perform some
- * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than dirtied pages to ensure that reasonably
- * large amounts of I/O are submitted.
+ * Don't sleep more than 200ms at a time in balance_dirty_pages().
  */
-static inline long sync_writeback_pages(unsigned long dirtied)
-{
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
-	return dirtied + dirtied / 2;
-}
+#define MAX_PAUSE	max(HZ/5, 1)
 
 /* The following parameters are exported via /proc/sys/vm */
 
@@ -279,7 +270,7 @@ static unsigned long task_dirty_limit(st
 {
 	long numerator, denominator;
 	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty >> 3;
+	u64 inv = dirty / TASK_SOFT_DIRTY_LIMIT;
 
 	task_dirties_fraction(tsk, &numerator, &denominator);
 	inv *= numerator;
@@ -509,26 +500,25 @@ unsigned long bdi_dirty_limit(struct bac
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
 	long nr_reclaimable, bdi_nr_reclaimable;
 	long nr_writeback, bdi_nr_writeback;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
+	unsigned long bw;
+	unsigned long pause;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
 	for (;;) {
-		struct writeback_control wbc = {
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-			.range_cyclic	= 1,
-		};
-
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_writeback = global_page_state(NR_WRITEBACK);
@@ -566,6 +556,23 @@ static void balance_dirty_pages(struct a
 			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
+		if (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh) {
+			pause = MAX_PAUSE;
+			goto pause;
+		}
+
+		bw = 100 << 20; /* use static 100MB/s for the moment */
+
+		bw = bw * (bdi_thresh - (bdi_nr_reclaimable + bdi_nr_writeback));
+		do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
+
+		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
+		pause = clamp_val(pause, 1, MAX_PAUSE);
+
+pause:
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+		io_schedule_timeout(pause);
+
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the
 		 * global "hard" limit. The former helps to prevent heavy IO
@@ -581,35 +588,6 @@ static void balance_dirty_pages(struct a
 
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
-
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_wbc_balance_dirty_start(&wbc, bdi);
-		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes_wb(&bdi->wb, &wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			trace_wbc_balance_dirty_written(&wbc, bdi);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
-		}
-		trace_wbc_balance_dirty_wait(&wbc, bdi);
-		__set_current_state(TASK_INTERRUPTIBLE);
-		io_schedule_timeout(pause);
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
 	}
 
 	if (!dirty_exceeded && bdi->dirty_exceeded)
@@ -626,7 +604,7 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
+	if ((laptop_mode && dirty_exceeded) ||
 	    (!laptop_mode && (nr_reclaimable > background_thresh)))
 		bdi_start_background_writeback(bdi);
 }
@@ -675,7 +653,7 @@ void balance_dirty_pages_ratelimited_nr(
 	p =  &__get_cpu_var(bdp_ratelimits);
 	*p += nr_pages_dirtied;
 	if (unlikely(*p >= ratelimit)) {
-		ratelimit = sync_writeback_pages(*p);
+		ratelimit = *p;
 		*p = 0;
 		preempt_enable();
 		balance_dirty_pages(mapping, ratelimit);
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-next/Documentation/filesystems/writeback-throttling-design.txt	2010-12-13 21:46:12.000000000 +0800
@@ -0,0 +1,210 @@
+writeback throttling design
+---------------------------
+
+introduction to dirty throttling
+--------------------------------
+
+The write(2) is normally buffered write that creates dirty page cache pages
+for holding the data and return immediately. The dirty pages will eventually
+be written to disk, or be dropped by unlink()/truncate().
+
+The delayed writeback of dirty pages enables the kernel to optimize the IO:
+
+- turn IO into async ones, which avoids blocking the tasks
+- submit IO as a batch for better throughput
+- avoid IO at all for temp files
+
+However, there have to be some limits on the number of allowable dirty pages.
+Typically applications are able to dirty pages more quickly than storage
+devices can write them. When approaching the dirty limits, the dirtier tasks
+will be throttled (put to brief sleeps from time to time) by
+balance_dirty_pages() in order to balance the dirty speed and writeback speed.
+
+dirty limits
+------------
+
+The dirty limit defaults to 20% reclaimable memory, and can be tuned via one of
+the following sysctl interfaces:
+
+	/proc/sys/vm/dirty_ratio
+	/proc/sys/vm/dirty_bytes
+
+The ultimate goal of balance_dirty_pages() is to keep the global dirty pages
+under control.
+
+	dirty_limit = dirty_ratio * free_reclaimable_pages
+
+However a global threshold may create deadlock for stacked BDIs (loop, FUSE and
+local NFS mounts). When A writes to B, and A generates enough dirty pages to
+get throttled, B will never start writeback until the dirty pages go away.
+
+Another problem is inter device starvation. When there are concurrent writes to
+a slow device and a fast one, the latter may well be starved due to unnecessary
+throttling on its dirtier tasks, leading to big IO performance drop.
+
+The solution is to split the global dirty limit into per-bdi limits among all
+the backing devices and scale writeback cache per backing device, proportional
+to its writeout speed.
+
+	bdi_dirty_limit = bdi_weight * dirty_limit
+
+where bdi_weight (ranging from 0 to 1) reflects the recent writeout speed of
+the BDI.
+
+We further scale the bdi dirty limit inversly with the task's dirty rate.
+This makes heavy writers have a lower dirty limit than the occasional writer,
+to prevent a heavy dd from slowing down all other light writers in the system.
+
+	task_dirty_limit = bdi_dirty_limit - task_weight * bdi_dirty_limit/16
+
+pause time
+----------
+
+The main task of dirty throttling is to determine when and how long to pause
+the current dirtier task.  Basically we want to
+
+- avoid too small pause time (less than 1 jiffy, which burns CPU power)
+- avoid too large pause time (more than 200ms, which hurts responsiveness)
+- avoid big fluctuations of pause times
+
+To smoothly control the pause time, we do soft throttling in a small region
+under task_dirty_limit, starting from
+
+	task_throttle_thresh = task_dirty_limit - task_dirty_limit/16
+
+In fig.1, when bdi_dirty_pages falls into
+
+    [0, La]:    do nothing
+    [La, A]:    do soft throttling
+    [A, inf]:   do hard throttling
+
+Where hard throttling is to wait until bdi_dirty_pages falls more than
+task_dirtied_pages (the pages dirtied by the task since its last throttle
+time). It's "hard" because it may end up waiting for long time.
+
+Fig.1 dirty throttling regions
+                                              o
+                                                o
+                                                  o
+                                                    o
+                                                      o
+                                                        o
+                                                          o
+                                                            o
+----------------------------------------------+---------------o----------------|
+                                              La              A                T
+                no throttle                     soft throttle   hard throttle
+  T: bdi_dirty_limit
+  A: task_dirty_limit      = T - task_weight * T/16
+  La: task_throttle_thresh = A - A/16
+
+Soft dirty throttling is to pause the dirtier task for J:pause_time jiffies on
+every N:task_dirtied_pages pages it dirtied.  Let's call (N/J) the "throttle
+bandwidth". It is computed by the following formula:
+
+                                     task_dirty_limit - bdi_dirty_pages
+throttle_bandwidth = bdi_bandwidth * ----------------------------------
+                                           task_dirty_limit/16
+
+where bdi_bandwidth is the BDI's estimated write speed.
+
+Given the throttle_bandwidth for a task, we select a suitable N, so that when
+the task dirties so much pages, it enters balance_dirty_pages() to sleep for
+roughly J jiffies. N is adaptive to storage and task write speeds, so that the
+task always get suitable (not too long or small) pause time.
+
+dynamics
+--------
+
+When there is one heavy dirtier, bdi_dirty_pages will keep growing until
+exceeding the low threshold of the task's soft throttling region [La, A].
+At which point (La) the task will be controlled under speed
+throttle_bandwidth=bdi_bandwidth (fig.2) and remain stable there.
+
+Fig.2 one heavy dirtier
+
+    throttle_bandwidth ~= bdi_bandwidth  =>   o
+                                              | o
+                                              |   o
+                                              |     o
+                                              |       o
+                                              |         o
+                                              |           o
+                                            La|             o
+----------------------------------------------+---------------o----------------|
+                                              R               A                T
+  R: bdi_dirty_pages ~= La
+
+When there comes a new dd task B, task_weight_B will gradually grow from 0 to
+50% while task_weight_A will decrease from 100% to 50%.  When task_weight_B is
+still small, B is considered a light dirtier and is allowed to dirty pages much
+faster than the bdi write bandwidth. In fact initially it won't be throttled at
+all when R < Lb where Lb = B - B/16 and B ~= T.
+
+Fig.3 an old dd (A) + a newly started dd (B)
+
+                      throttle bandwidth  =>    *
+                                                | *
+                                                |   *
+                                                |     *
+                                                |       *
+                                                |         *
+                                                |           *
+                                                |             *
+                      throttle bandwidth  =>    o               *
+                                                | o               *
+                                                |   o               *
+                                                |     o               *
+                                                |       o               *
+                                                |         o               *
+                                                |           o               *
+------------------------------------------------+-------------o---------------*|
+                                                R             A               BT
+
+So R:bdi_dirty_pages will grow large. As task_weight_A and task_weight_B
+converge to 50%, the points A, B will go towards each other (fig.4) and
+eventually coincide with each other. R will stabilize around A-A/32 where
+A=B=T-0.5*T/16.  throttle_bandwidth will stabilize around bdi_bandwidth/2.
+
+Note that the application "think+dirty time" is ignored for simplicity in the
+above discussions. With non-zero user space think time, the balance point will
+slightly drift and not a big deal otherwise.
+
+Fig.4 the two dd's converging to the same bandwidth
+
+                                                         |
+                                 throttle bandwidth  =>  *
+                                                         | *
+                                 throttle bandwidth  =>  o   *
+                                                         | o   *
+                                                         |   o   *
+                                                         |     o   *
+                                                         |       o   *
+                                                         |         o   *
+---------------------------------------------------------+-----------o---*-----|
+                                                         R           A   B     T
+
+There won't be big oscillations between A and B, because as soon as A coincides
+with B, their throttle_bandwidth and hence dirty speed will be equal, A's
+weight will stop decreasing and B's weight will stop growing, so the two points
+won't keep moving and cross each other.
+
+Sure there are always oscillations of bdi_dirty_pages as long as the dirtier
+task alternatively do dirty and pause. But it will be bounded. When there is 1
+heavy dirtier, the error bound will be (pause_time * bdi_bandwidth). When there
+are 2 heavy dirtiers, the max error is 2 * (pause_time * bdi_bandwidth/2),
+which remains the same as 1 dirtier case (given the same pause time). In fact
+the more dirtier tasks, the less errors will be, since the dirtier tasks are
+not likely going to sleep at the same time.
+
+References
+----------
+
+Smarter write throttling
+http://lwn.net/Articles/245600/
+
+Flushing out pdflush
+http://lwn.net/Articles/326552/
+
+Dirty throttling slides
+http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling.pdf


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 06/35] writeback: consolidate variable names in balance_dirty_pages()
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:46   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-cleanup-name-merge.patch --]
[-- Type: text/plain, Size: 3249 bytes --]

Lots of lengthy tests.. Let's compact the names

	nr_dirty = NR_FILE_DIRTY + NR_WRITEBACK + NR_UNSTABLE_NFS

balance_dirty_pages() only cares about the above dirty sum except
in one place -- on starting background writeback.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   30 ++++++++++++++----------------
 1 file changed, 14 insertions(+), 16 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:12.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:12.000000000 +0800
@@ -502,8 +502,9 @@ unsigned long bdi_dirty_limit(struct bac
 static void balance_dirty_pages(struct address_space *mapping,
 				unsigned long pages_dirtied)
 {
-	long nr_reclaimable, bdi_nr_reclaimable;
-	long nr_writeback, bdi_nr_writeback;
+	long nr_reclaimable;
+	long nr_dirty;
+	long bdi_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
@@ -521,7 +522,7 @@ static void balance_dirty_pages(struct a
 		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
 
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 
@@ -530,12 +531,10 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_reclaimable + nr_writeback <=
-				(background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
 			break;
 
-		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh,
-					     nr_reclaimable + nr_writeback);
+		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh, nr_dirty);
 		bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
@@ -549,21 +548,21 @@ static void balance_dirty_pages(struct a
 		 * deltas.
 		 */
 		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
+			bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
+				    bdi_stat_sum(bdi, BDI_WRITEBACK);
 		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+			bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) +
+				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
-		if (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh) {
+		if (bdi_dirty >= bdi_thresh) {
 			pause = MAX_PAUSE;
 			goto pause;
 		}
 
 		bw = 100 << 20; /* use static 100MB/s for the moment */
 
-		bw = bw * (bdi_thresh - (bdi_nr_reclaimable + bdi_nr_writeback));
+		bw = bw * (bdi_thresh - bdi_dirty);
 		do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
 
 		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
@@ -579,9 +578,8 @@ pause:
 		 * bdi or process from holding back light ones; The latter is
 		 * the last resort safeguard.
 		 */
-		dirty_exceeded =
-			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
-			|| (nr_reclaimable + nr_writeback > dirty_thresh);
+		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
+				  (nr_dirty > dirty_thresh);
 
 		if (!dirty_exceeded)
 			break;



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 06/35] writeback: consolidate variable names in balance_dirty_pages()
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-cleanup-name-merge.patch --]
[-- Type: text/plain, Size: 3545 bytes --]

Lots of lengthy tests.. Let's compact the names

	nr_dirty = NR_FILE_DIRTY + NR_WRITEBACK + NR_UNSTABLE_NFS

balance_dirty_pages() only cares about the above dirty sum except
in one place -- on starting background writeback.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   30 ++++++++++++++----------------
 1 file changed, 14 insertions(+), 16 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:12.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:12.000000000 +0800
@@ -502,8 +502,9 @@ unsigned long bdi_dirty_limit(struct bac
 static void balance_dirty_pages(struct address_space *mapping,
 				unsigned long pages_dirtied)
 {
-	long nr_reclaimable, bdi_nr_reclaimable;
-	long nr_writeback, bdi_nr_writeback;
+	long nr_reclaimable;
+	long nr_dirty;
+	long bdi_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
@@ -521,7 +522,7 @@ static void balance_dirty_pages(struct a
 		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
 
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 
@@ -530,12 +531,10 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_reclaimable + nr_writeback <=
-				(background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
 			break;
 
-		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh,
-					     nr_reclaimable + nr_writeback);
+		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh, nr_dirty);
 		bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
@@ -549,21 +548,21 @@ static void balance_dirty_pages(struct a
 		 * deltas.
 		 */
 		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
+			bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
+				    bdi_stat_sum(bdi, BDI_WRITEBACK);
 		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+			bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) +
+				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
-		if (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh) {
+		if (bdi_dirty >= bdi_thresh) {
 			pause = MAX_PAUSE;
 			goto pause;
 		}
 
 		bw = 100 << 20; /* use static 100MB/s for the moment */
 
-		bw = bw * (bdi_thresh - (bdi_nr_reclaimable + bdi_nr_writeback));
+		bw = bw * (bdi_thresh - bdi_dirty);
 		do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
 
 		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
@@ -579,9 +578,8 @@ pause:
 		 * bdi or process from holding back light ones; The latter is
 		 * the last resort safeguard.
 		 */
-		dirty_exceeded =
-			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
-			|| (nr_reclaimable + nr_writeback > dirty_thresh);
+		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
+				  (nr_dirty > dirty_thresh);
 
 		if (!dirty_exceeded)
 			break;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 06/35] writeback: consolidate variable names in balance_dirty_pages()
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-cleanup-name-merge.patch --]
[-- Type: text/plain, Size: 3545 bytes --]

Lots of lengthy tests.. Let's compact the names

	nr_dirty = NR_FILE_DIRTY + NR_WRITEBACK + NR_UNSTABLE_NFS

balance_dirty_pages() only cares about the above dirty sum except
in one place -- on starting background writeback.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   30 ++++++++++++++----------------
 1 file changed, 14 insertions(+), 16 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:12.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:12.000000000 +0800
@@ -502,8 +502,9 @@ unsigned long bdi_dirty_limit(struct bac
 static void balance_dirty_pages(struct address_space *mapping,
 				unsigned long pages_dirtied)
 {
-	long nr_reclaimable, bdi_nr_reclaimable;
-	long nr_writeback, bdi_nr_writeback;
+	long nr_reclaimable;
+	long nr_dirty;
+	long bdi_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
@@ -521,7 +522,7 @@ static void balance_dirty_pages(struct a
 		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
 
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 
@@ -530,12 +531,10 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_reclaimable + nr_writeback <=
-				(background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
 			break;
 
-		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh,
-					     nr_reclaimable + nr_writeback);
+		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh, nr_dirty);
 		bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
@@ -549,21 +548,21 @@ static void balance_dirty_pages(struct a
 		 * deltas.
 		 */
 		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
+			bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
+				    bdi_stat_sum(bdi, BDI_WRITEBACK);
 		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+			bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) +
+				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
-		if (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh) {
+		if (bdi_dirty >= bdi_thresh) {
 			pause = MAX_PAUSE;
 			goto pause;
 		}
 
 		bw = 100 << 20; /* use static 100MB/s for the moment */
 
-		bw = bw * (bdi_thresh - (bdi_nr_reclaimable + bdi_nr_writeback));
+		bw = bw * (bdi_thresh - bdi_dirty);
 		do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
 
 		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
@@ -579,9 +578,8 @@ pause:
 		 * bdi or process from holding back light ones; The latter is
 		 * the last resort safeguard.
 		 */
-		dirty_exceeded =
-			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
-			|| (nr_reclaimable + nr_writeback > dirty_thresh);
+		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
+				  (nr_dirty > dirty_thresh);
 
 		if (!dirty_exceeded)
 			break;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 07/35] writeback: per-task rate limit on balance_dirty_pages()
  2010-12-13 14:46 ` Wu Fengguang
@ 2010-12-13 14:46   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-per-task-dirty-count.patch --]
[-- Type: text/plain, Size: 9354 bytes --]

Try to limit the dirty throttle pause time in range [1 jiffy, 100 ms],
by controlling how many pages can be dirtied before inserting a pause.

The dirty count will be directly billed to the task struct. Slow start
and quick back off is employed, so that the stable range will be biased
towards less than 50ms. Another intention is for fine timing control of
slow devices, which may need to do full 100ms pauses for every 1 page.

The switch from per-cpu to per-task rate limit makes it easier to exceed
the global dirty limit with a fork bomb, where each new task dirties 1 page,
sleep 10m and continue to dirty 1000 more pages. The caveat is, when it
dirties the first page, it may be honoured a high nr_dirtied_pause
because nr_dirty is still low at that time. In this way lots of tasks
get the free tickets to dirty more pages than allowed. The solution is
to disable rate limiting (ie. to ignore nr_dirtied_pause) totally once
the bdi becomes dirty exceeded.

Note that some filesystems will dirty a batch of pages before calling
balance_dirty_pages_ratelimited_nr(). They saves a little CPU overheads
at the cost of possibly overrunning the dirty limits a bit and/or in the
case of very slow devices, pause the application for much more than
100ms at a time. This is a trade-off, and seems reasonable optimization
as long as the batch size is controlled within a dozen pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    7 ++
 mm/memory_hotplug.c   |    3 
 mm/page-writeback.c   |  126 ++++++++++++++++++----------------------
 3 files changed, 65 insertions(+), 71 deletions(-)

--- linux-next.orig/include/linux/sched.h	2010-12-13 21:45:57.000000000 +0800
+++ linux-next/include/linux/sched.h	2010-12-13 21:46:13.000000000 +0800
@@ -1471,6 +1471,13 @@ struct task_struct {
 	int make_it_fail;
 #endif
 	struct prop_local_single dirties;
+	/*
+	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
+	 * balance_dirty_pages() for some dirty throttling pause
+	 */
+	int nr_dirtied;
+	int nr_dirtied_pause;
+
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
 	struct latency_record latency_record[LT_SAVECOUNT];
--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:12.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:13.000000000 +0800
@@ -37,12 +37,6 @@
 #include <trace/events/writeback.h>
 
 /*
- * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
- * will look to see if it needs to force writeback or throttling.
- */
-static long ratelimit_pages = 32;
-
-/*
  * Don't sleep more than 200ms at a time in balance_dirty_pages().
  */
 #define MAX_PAUSE	max(HZ/5, 1)
@@ -493,6 +487,40 @@ unsigned long bdi_dirty_limit(struct bac
 }
 
 /*
+ * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
+ * will look to see if it needs to start dirty throttling.
+ *
+ * If ratelimit_pages is too low then big NUMA machines will call the expensive
+ * global_page_state() too often. So scale it adaptively to the safety margin
+ * (the number of pages we may dirty without exceeding the dirty limits).
+ */
+static unsigned long ratelimit_pages(struct backing_dev_info *bdi)
+{
+	unsigned long background_thresh;
+	unsigned long dirty_thresh;
+	unsigned long dirty_pages;
+
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+	dirty_pages = global_page_state(NR_FILE_DIRTY) +
+		      global_page_state(NR_WRITEBACK) +
+		      global_page_state(NR_UNSTABLE_NFS);
+
+	if (dirty_pages <= (dirty_thresh + background_thresh) / 2)
+		goto out;
+
+	dirty_thresh = bdi_dirty_limit(bdi, dirty_thresh, dirty_pages);
+	dirty_pages  = bdi_stat(bdi, BDI_RECLAIMABLE) +
+		       bdi_stat(bdi, BDI_WRITEBACK);
+
+	if (dirty_pages < dirty_thresh)
+		goto out;
+
+	return 1;
+out:
+	return 1 + int_sqrt(dirty_thresh - dirty_pages);
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -509,7 +537,7 @@ static void balance_dirty_pages(struct a
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
 	unsigned long bw;
-	unsigned long pause;
+	unsigned long pause = 0;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
@@ -591,6 +619,17 @@ pause:
 	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
+	if (pause == 0 && nr_dirty < background_thresh)
+		current->nr_dirtied_pause = ratelimit_pages(bdi);
+	else if (pause == 1)
+		current->nr_dirtied_pause += current->nr_dirtied_pause / 32 + 1;
+	else if (pause >= MAX_PAUSE)
+		/*
+		 * when repeated, writing 1 page per 100ms on slow devices,
+		 * i-(i+2)/4 will be able to reach 1 but never reduce to 0.
+		 */
+		current->nr_dirtied_pause -= (current->nr_dirtied_pause+2) >> 2;
+
 	if (writeback_in_progress(bdi))
 		return;
 
@@ -617,8 +656,6 @@ void set_page_dirty_balance(struct page 
 	}
 }
 
-static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
-
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
  * @mapping: address_space which was dirtied
@@ -628,36 +665,30 @@ static DEFINE_PER_CPU(unsigned long, bdp
  * which was newly dirtied.  The function will periodically check the system's
  * dirty state and will initiate writeback if needed.
  *
- * On really big machines, get_writeback_state is expensive, so try to avoid
+ * On really big machines, global_page_state() is expensive, so try to avoid
  * calling it too often (ratelimiting).  But once we're over the dirty memory
- * limit we decrease the ratelimiting by a lot, to prevent individual processes
- * from overshooting the limit by (ratelimit_pages) each.
+ * limit we disable the ratelimiting, to prevent individual processes from
+ * overshooting the limit by (ratelimit_pages) each.
  */
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 					unsigned long nr_pages_dirtied)
 {
-	unsigned long ratelimit;
-	unsigned long *p;
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+
+	current->nr_dirtied += nr_pages_dirtied;
 
-	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
-		ratelimit = 8;
+	if (unlikely(!current->nr_dirtied_pause))
+		current->nr_dirtied_pause = ratelimit_pages(bdi);
 
 	/*
 	 * Check the rate limiting. Also, we do not want to throttle real-time
 	 * tasks in balance_dirty_pages(). Period.
 	 */
-	preempt_disable();
-	p =  &__get_cpu_var(bdp_ratelimits);
-	*p += nr_pages_dirtied;
-	if (unlikely(*p >= ratelimit)) {
-		ratelimit = *p;
-		*p = 0;
-		preempt_enable();
-		balance_dirty_pages(mapping, ratelimit);
-		return;
+	if (unlikely(current->nr_dirtied >= current->nr_dirtied_pause ||
+		     bdi->dirty_exceeded)) {
+		balance_dirty_pages(mapping, current->nr_dirtied);
+		current->nr_dirtied = 0;
 	}
-	preempt_enable();
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
@@ -745,44 +776,6 @@ void laptop_sync_completion(void)
 #endif
 
 /*
- * If ratelimit_pages is too high then we can get into dirty-data overload
- * if a large number of processes all perform writes at the same time.
- * If it is too low then SMP machines will call the (expensive)
- * get_writeback_state too often.
- *
- * Here we set ratelimit_pages to a level which ensures that when all CPUs are
- * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
- * thresholds before writeback cuts in.
- *
- * But the limit should not be set too high.  Because it also controls the
- * amount of memory which the balance_dirty_pages() caller has to write back.
- * If this is too large then the caller will block on the IO queue all the
- * time.  So limit it to four megabytes - the balance_dirty_pages() caller
- * will write six megabyte chunks, max.
- */
-
-void writeback_set_ratelimit(void)
-{
-	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
-	if (ratelimit_pages < 16)
-		ratelimit_pages = 16;
-	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
-		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
-}
-
-static int __cpuinit
-ratelimit_handler(struct notifier_block *self, unsigned long u, void *v)
-{
-	writeback_set_ratelimit();
-	return NOTIFY_DONE;
-}
-
-static struct notifier_block __cpuinitdata ratelimit_nb = {
-	.notifier_call	= ratelimit_handler,
-	.next		= NULL,
-};
-
-/*
  * Called early on to tune the page writeback dirty limits.
  *
  * We used to scale dirty pages according to how total memory
@@ -804,9 +797,6 @@ void __init page_writeback_init(void)
 {
 	int shift;
 
-	writeback_set_ratelimit();
-	register_cpu_notifier(&ratelimit_nb);
-
 	shift = calc_period_shift();
 	prop_descriptor_init(&vm_completions, shift);
 	prop_descriptor_init(&vm_dirties, shift);
--- linux-next.orig/mm/memory_hotplug.c	2010-12-13 21:45:57.000000000 +0800
+++ linux-next/mm/memory_hotplug.c	2010-12-13 21:46:13.000000000 +0800
@@ -446,8 +446,6 @@ int online_pages(unsigned long pfn, unsi
 
 	vm_total_pages = nr_free_pagecache_pages();
 
-	writeback_set_ratelimit();
-
 	if (onlined_pages)
 		memory_notify(MEM_ONLINE, &arg);
 
@@ -877,7 +875,6 @@ repeat:
 	}
 
 	vm_total_pages = nr_free_pagecache_pages();
-	writeback_set_ratelimit();
 
 	memory_notify(MEM_OFFLINE, &arg);
 	unlock_system_sleep();



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 07/35] writeback: per-task rate limit on balance_dirty_pages()
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-per-task-dirty-count.patch --]
[-- Type: text/plain, Size: 9352 bytes --]

Try to limit the dirty throttle pause time in range [1 jiffy, 100 ms],
by controlling how many pages can be dirtied before inserting a pause.

The dirty count will be directly billed to the task struct. Slow start
and quick back off is employed, so that the stable range will be biased
towards less than 50ms. Another intention is for fine timing control of
slow devices, which may need to do full 100ms pauses for every 1 page.

The switch from per-cpu to per-task rate limit makes it easier to exceed
the global dirty limit with a fork bomb, where each new task dirties 1 page,
sleep 10m and continue to dirty 1000 more pages. The caveat is, when it
dirties the first page, it may be honoured a high nr_dirtied_pause
because nr_dirty is still low at that time. In this way lots of tasks
get the free tickets to dirty more pages than allowed. The solution is
to disable rate limiting (ie. to ignore nr_dirtied_pause) totally once
the bdi becomes dirty exceeded.

Note that some filesystems will dirty a batch of pages before calling
balance_dirty_pages_ratelimited_nr(). They saves a little CPU overheads
at the cost of possibly overrunning the dirty limits a bit and/or in the
case of very slow devices, pause the application for much more than
100ms at a time. This is a trade-off, and seems reasonable optimization
as long as the batch size is controlled within a dozen pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    7 ++
 mm/memory_hotplug.c   |    3 
 mm/page-writeback.c   |  126 ++++++++++++++++++----------------------
 3 files changed, 65 insertions(+), 71 deletions(-)

--- linux-next.orig/include/linux/sched.h	2010-12-13 21:45:57.000000000 +0800
+++ linux-next/include/linux/sched.h	2010-12-13 21:46:13.000000000 +0800
@@ -1471,6 +1471,13 @@ struct task_struct {
 	int make_it_fail;
 #endif
 	struct prop_local_single dirties;
+	/*
+	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
+	 * balance_dirty_pages() for some dirty throttling pause
+	 */
+	int nr_dirtied;
+	int nr_dirtied_pause;
+
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
 	struct latency_record latency_record[LT_SAVECOUNT];
--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:12.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:13.000000000 +0800
@@ -37,12 +37,6 @@
 #include <trace/events/writeback.h>
 
 /*
- * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
- * will look to see if it needs to force writeback or throttling.
- */
-static long ratelimit_pages = 32;
-
-/*
  * Don't sleep more than 200ms at a time in balance_dirty_pages().
  */
 #define MAX_PAUSE	max(HZ/5, 1)
@@ -493,6 +487,40 @@ unsigned long bdi_dirty_limit(struct bac
 }
 
 /*
+ * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
+ * will look to see if it needs to start dirty throttling.
+ *
+ * If ratelimit_pages is too low then big NUMA machines will call the expensive
+ * global_page_state() too often. So scale it adaptively to the safety margin
+ * (the number of pages we may dirty without exceeding the dirty limits).
+ */
+static unsigned long ratelimit_pages(struct backing_dev_info *bdi)
+{
+	unsigned long background_thresh;
+	unsigned long dirty_thresh;
+	unsigned long dirty_pages;
+
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+	dirty_pages = global_page_state(NR_FILE_DIRTY) +
+		      global_page_state(NR_WRITEBACK) +
+		      global_page_state(NR_UNSTABLE_NFS);
+
+	if (dirty_pages <= (dirty_thresh + background_thresh) / 2)
+		goto out;
+
+	dirty_thresh = bdi_dirty_limit(bdi, dirty_thresh, dirty_pages);
+	dirty_pages  = bdi_stat(bdi, BDI_RECLAIMABLE) +
+		       bdi_stat(bdi, BDI_WRITEBACK);
+
+	if (dirty_pages < dirty_thresh)
+		goto out;
+
+	return 1;
+out:
+	return 1 + int_sqrt(dirty_thresh - dirty_pages);
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -509,7 +537,7 @@ static void balance_dirty_pages(struct a
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
 	unsigned long bw;
-	unsigned long pause;
+	unsigned long pause = 0;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
@@ -591,6 +619,17 @@ pause:
 	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
+	if (pause == 0 && nr_dirty < background_thresh)
+		current->nr_dirtied_pause = ratelimit_pages(bdi);
+	else if (pause == 1)
+		current->nr_dirtied_pause += current->nr_dirtied_pause / 32 + 1;
+	else if (pause >= MAX_PAUSE)
+		/*
+		 * when repeated, writing 1 page per 100ms on slow devices,
+		 * i-(i+2)/4 will be able to reach 1 but never reduce to 0.
+		 */
+		current->nr_dirtied_pause -= (current->nr_dirtied_pause+2) >> 2;
+
 	if (writeback_in_progress(bdi))
 		return;
 
@@ -617,8 +656,6 @@ void set_page_dirty_balance(struct page 
 	}
 }
 
-static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
-
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
  * @mapping: address_space which was dirtied
@@ -628,36 +665,30 @@ static DEFINE_PER_CPU(unsigned long, bdp
  * which was newly dirtied.  The function will periodically check the system's
  * dirty state and will initiate writeback if needed.
  *
- * On really big machines, get_writeback_state is expensive, so try to avoid
+ * On really big machines, global_page_state() is expensive, so try to avoid
  * calling it too often (ratelimiting).  But once we're over the dirty memory
- * limit we decrease the ratelimiting by a lot, to prevent individual processes
- * from overshooting the limit by (ratelimit_pages) each.
+ * limit we disable the ratelimiting, to prevent individual processes from
+ * overshooting the limit by (ratelimit_pages) each.
  */
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 					unsigned long nr_pages_dirtied)
 {
-	unsigned long ratelimit;
-	unsigned long *p;
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+
+	current->nr_dirtied += nr_pages_dirtied;
 
-	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
-		ratelimit = 8;
+	if (unlikely(!current->nr_dirtied_pause))
+		current->nr_dirtied_pause = ratelimit_pages(bdi);
 
 	/*
 	 * Check the rate limiting. Also, we do not want to throttle real-time
 	 * tasks in balance_dirty_pages(). Period.
 	 */
-	preempt_disable();
-	p =  &__get_cpu_var(bdp_ratelimits);
-	*p += nr_pages_dirtied;
-	if (unlikely(*p >= ratelimit)) {
-		ratelimit = *p;
-		*p = 0;
-		preempt_enable();
-		balance_dirty_pages(mapping, ratelimit);
-		return;
+	if (unlikely(current->nr_dirtied >= current->nr_dirtied_pause ||
+		     bdi->dirty_exceeded)) {
+		balance_dirty_pages(mapping, current->nr_dirtied);
+		current->nr_dirtied = 0;
 	}
-	preempt_enable();
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
@@ -745,44 +776,6 @@ void laptop_sync_completion(void)
 #endif
 
 /*
- * If ratelimit_pages is too high then we can get into dirty-data overload
- * if a large number of processes all perform writes at the same time.
- * If it is too low then SMP machines will call the (expensive)
- * get_writeback_state too often.
- *
- * Here we set ratelimit_pages to a level which ensures that when all CPUs are
- * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
- * thresholds before writeback cuts in.
- *
- * But the limit should not be set too high.  Because it also controls the
- * amount of memory which the balance_dirty_pages() caller has to write back.
- * If this is too large then the caller will block on the IO queue all the
- * time.  So limit it to four megabytes - the balance_dirty_pages() caller
- * will write six megabyte chunks, max.
- */
-
-void writeback_set_ratelimit(void)
-{
-	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
-	if (ratelimit_pages < 16)
-		ratelimit_pages = 16;
-	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
-		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
-}
-
-static int __cpuinit
-ratelimit_handler(struct notifier_block *self, unsigned long u, void *v)
-{
-	writeback_set_ratelimit();
-	return NOTIFY_DONE;
-}
-
-static struct notifier_block __cpuinitdata ratelimit_nb = {
-	.notifier_call	= ratelimit_handler,
-	.next		= NULL,
-};
-
-/*
  * Called early on to tune the page writeback dirty limits.
  *
  * We used to scale dirty pages according to how total memory
@@ -804,9 +797,6 @@ void __init page_writeback_init(void)
 {
 	int shift;
 
-	writeback_set_ratelimit();
-	register_cpu_notifier(&ratelimit_nb);
-
 	shift = calc_period_shift();
 	prop_descriptor_init(&vm_completions, shift);
 	prop_descriptor_init(&vm_dirties, shift);
--- linux-next.orig/mm/memory_hotplug.c	2010-12-13 21:45:57.000000000 +0800
+++ linux-next/mm/memory_hotplug.c	2010-12-13 21:46:13.000000000 +0800
@@ -446,8 +446,6 @@ int online_pages(unsigned long pfn, unsi
 
 	vm_total_pages = nr_free_pagecache_pages();
 
-	writeback_set_ratelimit();
-
 	if (onlined_pages)
 		memory_notify(MEM_ONLINE, &arg);
 
@@ -877,7 +875,6 @@ repeat:
 	}
 
 	vm_total_pages = nr_free_pagecache_pages();
-	writeback_set_ratelimit();
 
 	memory_notify(MEM_OFFLINE, &arg);
 	unlock_system_sleep();

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 08/35] writeback: user space think time compensation
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:46   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-task-last-dirty-time.patch --]
[-- Type: text/plain, Size: 3248 bytes --]

Take the task's think time into account when computing the final pause time.
This will make accurate throttle bandwidth. In the rare case that the task
slept longer than the period time, the extra sleep time will also be
compensated in next period if it's not too big (<100ms).  Accumulated
errors are carefully avoided as long as the task don't sleep for too
long time.

case 1: period > think

		pause = period - think
		paused_when += pause

			     period time
	      |======================================>|
		  think time
	      |===============>|
	------|----------------|----------------------|-----------
	paused_when         jiffies


case 2: period <= think

		don't pause and reduce future pause time by:
		paused_when += period

		       period time
	      |=========================>|
			     think time
	      |======================================>|
	------|--------------------------+------------|-----------
	paused_when                                jiffies


Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    1 +
 mm/page-writeback.c   |   22 ++++++++++++++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/sched.h	2010-12-13 21:46:13.000000000 +0800
+++ linux-next/include/linux/sched.h	2010-12-13 21:46:13.000000000 +0800
@@ -1477,6 +1477,7 @@ struct task_struct {
 	 */
 	int nr_dirtied;
 	int nr_dirtied_pause;
+	unsigned long paused_when;	/* start of a write-and-pause period */
 
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:13.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:13.000000000 +0800
@@ -537,6 +537,7 @@ static void balance_dirty_pages(struct a
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
 	unsigned long bw;
+	unsigned long period;
 	unsigned long pause = 0;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
@@ -583,7 +584,7 @@ static void balance_dirty_pages(struct a
 				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
-		if (bdi_dirty >= bdi_thresh) {
+		if (bdi_dirty >= bdi_thresh || nr_dirty > dirty_thresh) {
 			pause = MAX_PAUSE;
 			goto pause;
 		}
@@ -593,12 +594,29 @@ static void balance_dirty_pages(struct a
 		bw = bw * (bdi_thresh - bdi_dirty);
 		do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
 
-		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
+		period = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1) + 1;
+		pause = current->paused_when + period - jiffies;
+		/*
+		 * Take it as long think time if pause falls into (-10s, 0).
+		 * If it's less than 100ms, try to compensate it in future by
+		 * updating the virtual time; otherwise just reset the time, as
+		 * it may be a light dirtier.
+		 */
+		if (unlikely(-pause < HZ*10)) {
+			if (-pause <= HZ/10)
+				current->paused_when += period;
+			else
+				current->paused_when = jiffies;
+			pause = 1;
+			break;
+		}
 		pause = clamp_val(pause, 1, MAX_PAUSE);
 
 pause:
+		current->paused_when = jiffies;
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
+		current->paused_when += pause;
 
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 08/35] writeback: user space think time compensation
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-task-last-dirty-time.patch --]
[-- Type: text/plain, Size: 3544 bytes --]

Take the task's think time into account when computing the final pause time.
This will make accurate throttle bandwidth. In the rare case that the task
slept longer than the period time, the extra sleep time will also be
compensated in next period if it's not too big (<100ms).  Accumulated
errors are carefully avoided as long as the task don't sleep for too
long time.

case 1: period > think

		pause = period - think
		paused_when += pause

			     period time
	      |======================================>|
		  think time
	      |===============>|
	------|----------------|----------------------|-----------
	paused_when         jiffies


case 2: period <= think

		don't pause and reduce future pause time by:
		paused_when += period

		       period time
	      |=========================>|
			     think time
	      |======================================>|
	------|--------------------------+------------|-----------
	paused_when                                jiffies


Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    1 +
 mm/page-writeback.c   |   22 ++++++++++++++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/sched.h	2010-12-13 21:46:13.000000000 +0800
+++ linux-next/include/linux/sched.h	2010-12-13 21:46:13.000000000 +0800
@@ -1477,6 +1477,7 @@ struct task_struct {
 	 */
 	int nr_dirtied;
 	int nr_dirtied_pause;
+	unsigned long paused_when;	/* start of a write-and-pause period */
 
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:13.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:13.000000000 +0800
@@ -537,6 +537,7 @@ static void balance_dirty_pages(struct a
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
 	unsigned long bw;
+	unsigned long period;
 	unsigned long pause = 0;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
@@ -583,7 +584,7 @@ static void balance_dirty_pages(struct a
 				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
-		if (bdi_dirty >= bdi_thresh) {
+		if (bdi_dirty >= bdi_thresh || nr_dirty > dirty_thresh) {
 			pause = MAX_PAUSE;
 			goto pause;
 		}
@@ -593,12 +594,29 @@ static void balance_dirty_pages(struct a
 		bw = bw * (bdi_thresh - bdi_dirty);
 		do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
 
-		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
+		period = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1) + 1;
+		pause = current->paused_when + period - jiffies;
+		/*
+		 * Take it as long think time if pause falls into (-10s, 0).
+		 * If it's less than 100ms, try to compensate it in future by
+		 * updating the virtual time; otherwise just reset the time, as
+		 * it may be a light dirtier.
+		 */
+		if (unlikely(-pause < HZ*10)) {
+			if (-pause <= HZ/10)
+				current->paused_when += period;
+			else
+				current->paused_when = jiffies;
+			pause = 1;
+			break;
+		}
 		pause = clamp_val(pause, 1, MAX_PAUSE);
 
 pause:
+		current->paused_when = jiffies;
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
+		current->paused_when += pause;
 
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 08/35] writeback: user space think time compensation
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-task-last-dirty-time.patch --]
[-- Type: text/plain, Size: 3544 bytes --]

Take the task's think time into account when computing the final pause time.
This will make accurate throttle bandwidth. In the rare case that the task
slept longer than the period time, the extra sleep time will also be
compensated in next period if it's not too big (<100ms).  Accumulated
errors are carefully avoided as long as the task don't sleep for too
long time.

case 1: period > think

		pause = period - think
		paused_when += pause

			     period time
	      |======================================>|
		  think time
	      |===============>|
	------|----------------|----------------------|-----------
	paused_when         jiffies


case 2: period <= think

		don't pause and reduce future pause time by:
		paused_when += period

		       period time
	      |=========================>|
			     think time
	      |======================================>|
	------|--------------------------+------------|-----------
	paused_when                                jiffies


Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    1 +
 mm/page-writeback.c   |   22 ++++++++++++++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/sched.h	2010-12-13 21:46:13.000000000 +0800
+++ linux-next/include/linux/sched.h	2010-12-13 21:46:13.000000000 +0800
@@ -1477,6 +1477,7 @@ struct task_struct {
 	 */
 	int nr_dirtied;
 	int nr_dirtied_pause;
+	unsigned long paused_when;	/* start of a write-and-pause period */
 
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:13.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:13.000000000 +0800
@@ -537,6 +537,7 @@ static void balance_dirty_pages(struct a
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
 	unsigned long bw;
+	unsigned long period;
 	unsigned long pause = 0;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
@@ -583,7 +584,7 @@ static void balance_dirty_pages(struct a
 				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
-		if (bdi_dirty >= bdi_thresh) {
+		if (bdi_dirty >= bdi_thresh || nr_dirty > dirty_thresh) {
 			pause = MAX_PAUSE;
 			goto pause;
 		}
@@ -593,12 +594,29 @@ static void balance_dirty_pages(struct a
 		bw = bw * (bdi_thresh - bdi_dirty);
 		do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
 
-		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
+		period = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1) + 1;
+		pause = current->paused_when + period - jiffies;
+		/*
+		 * Take it as long think time if pause falls into (-10s, 0).
+		 * If it's less than 100ms, try to compensate it in future by
+		 * updating the virtual time; otherwise just reset the time, as
+		 * it may be a light dirtier.
+		 */
+		if (unlikely(-pause < HZ*10)) {
+			if (-pause <= HZ/10)
+				current->paused_when += period;
+			else
+				current->paused_when = jiffies;
+			pause = 1;
+			break;
+		}
 		pause = clamp_val(pause, 1, MAX_PAUSE);
 
 pause:
+		current->paused_when = jiffies;
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
+		current->paused_when += pause;
 
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 09/35] writeback: account per-bdi accumulated written pages
  2010-12-13 14:46 ` Wu Fengguang
@ 2010-12-13 14:46   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bdi-written.patch --]
[-- Type: text/plain, Size: 2307 bytes --]

From: Jan Kara <jack@suse.cz>

Introduce the BDI_WRITTEN counter. It will be used for estimating the
bdi's write bandwidth.

Peter Zijlstra <a.p.zijlstra@chello.nl>:
Move BDI_WRITTEN accounting into __bdi_writeout_inc().
This will cover and fix fuse, which only calls bdi_writeout_inc().

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    6 ++++--
 mm/page-writeback.c         |    1 +
 3 files changed, 6 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2010-12-13 21:45:57.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-12-13 21:46:13.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
 
--- linux-next.orig/mm/backing-dev.c	2010-12-13 21:46:10.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-12-13 21:46:13.000000000 +0800
@@ -92,6 +92,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:   %8lu kB\n"
 		   "DirtyThresh:      %8lu kB\n"
 		   "BackgroundThresh: %8lu kB\n"
+		   "BdiWritten:       %8lu kB\n"
 		   "b_dirty:          %8lu\n"
 		   "b_io:             %8lu\n"
 		   "b_more_io:        %8lu\n"
@@ -99,8 +100,9 @@ static int bdi_debug_stats_show(struct s
 		   "state:            %8lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
-		   K(bdi_thresh), K(dirty_thresh),
-		   K(background_thresh), nr_dirty, nr_io, nr_more_io,
+		   K(bdi_thresh), K(dirty_thresh), K(background_thresh),
+		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+		   nr_dirty, nr_io, nr_more_io,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K
 
--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:13.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:13.000000000 +0800
@@ -204,6 +204,7 @@ int dirty_bytes_handler(struct ctl_table
  */
 static inline void __bdi_writeout_inc(struct backing_dev_info *bdi)
 {
+	__inc_bdi_stat(bdi, BDI_WRITTEN);
 	__prop_inc_percpu_max(&vm_completions, &bdi->completions,
 			      bdi->max_prop_frac);
 }



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 09/35] writeback: account per-bdi accumulated written pages
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bdi-written.patch --]
[-- Type: text/plain, Size: 2305 bytes --]

From: Jan Kara <jack@suse.cz>

Introduce the BDI_WRITTEN counter. It will be used for estimating the
bdi's write bandwidth.

Peter Zijlstra <a.p.zijlstra@chello.nl>:
Move BDI_WRITTEN accounting into __bdi_writeout_inc().
This will cover and fix fuse, which only calls bdi_writeout_inc().

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    6 ++++--
 mm/page-writeback.c         |    1 +
 3 files changed, 6 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2010-12-13 21:45:57.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-12-13 21:46:13.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
 
--- linux-next.orig/mm/backing-dev.c	2010-12-13 21:46:10.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-12-13 21:46:13.000000000 +0800
@@ -92,6 +92,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:   %8lu kB\n"
 		   "DirtyThresh:      %8lu kB\n"
 		   "BackgroundThresh: %8lu kB\n"
+		   "BdiWritten:       %8lu kB\n"
 		   "b_dirty:          %8lu\n"
 		   "b_io:             %8lu\n"
 		   "b_more_io:        %8lu\n"
@@ -99,8 +100,9 @@ static int bdi_debug_stats_show(struct s
 		   "state:            %8lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
-		   K(bdi_thresh), K(dirty_thresh),
-		   K(background_thresh), nr_dirty, nr_io, nr_more_io,
+		   K(bdi_thresh), K(dirty_thresh), K(background_thresh),
+		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+		   nr_dirty, nr_io, nr_more_io,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K
 
--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:13.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:13.000000000 +0800
@@ -204,6 +204,7 @@ int dirty_bytes_handler(struct ctl_table
  */
 static inline void __bdi_writeout_inc(struct backing_dev_info *bdi)
 {
+	__inc_bdi_stat(bdi, BDI_WRITTEN);
 	__prop_inc_percpu_max(&vm_completions, &bdi->completions,
 			      bdi->max_prop_frac);
 }

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 10/35] writeback: bdi write bandwidth estimation
  2010-12-13 14:46 ` Wu Fengguang
@ 2010-12-13 14:46   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Li Shaohua, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Peter Zijlstra, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bandwidth-estimation-in-flusher.patch --]
[-- Type: text/plain, Size: 6830 bytes --]

The estimation value will start from 100MB/s and adapt to the real
bandwidth in seconds.  It's pretty accurate for common filesystems.

As the first use case, it replaces the fixed 100MB/s value used for
throttle bandwidth calculation in balance_dirty_pages().

The overheads won't be high because the bdi bandwidth update only occurs
in >10ms intervals.

Initially it's only estimated in balance_dirty_pages() because this is
the most reliable place to get reasonable large bandwidth -- the bdi is
normally fully utilized when bdi_thresh is reached.

Then Shaohua recommends to also do it in the flusher thread, to keep the
value updated when there are only periodic/background writeback and no
tasks throttled.

The original plan is to use per-cpu vars for bdi->write_bandwidth.
However Peter suggested that it opens the window that some CPU see
outdated values. So switch to use spinlock protected global vars.

It tries to update the bandwidth only when disk is fully utilized.
Any inactive period of more than 500ms will be skipped.

The estimation is not done purely in the flusher thread because slow
devices may take dozens of seconds to write the initial 64MB chunk
(write_bandwidth starts with 100MB/s, this translates to 64MB
nr_to_write). So it may take more than 1 minute to adapt to the smallish
bandwidth if the bandwidth is only updated in the flusher thread.

CC: Li Shaohua <shaohua.li@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |    4 ++
 include/linux/backing-dev.h |    5 ++
 include/linux/writeback.h   |   10 +++++
 mm/backing-dev.c            |    3 +
 mm/page-writeback.c         |   59 ++++++++++++++++++++++++++++++++--
 5 files changed, 78 insertions(+), 3 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2010-12-13 21:46:13.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-12-13 21:46:14.000000000 +0800
@@ -74,6 +74,11 @@ struct backing_dev_info {
 
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
+	spinlock_t bw_lock;
+	unsigned long bw_time_stamp;
+	unsigned long written_stamp;
+	unsigned long write_bandwidth;
+
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
--- linux-next.orig/mm/backing-dev.c	2010-12-13 21:46:13.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-12-13 21:46:14.000000000 +0800
@@ -660,6 +660,9 @@ int bdi_init(struct backing_dev_info *bd
 			goto err;
 	}
 
+	spin_lock_init(&bdi->bw_lock);
+	bdi->write_bandwidth = 100 << (20 - PAGE_SHIFT);  /* 100 MB/s */
+
 	bdi->dirty_exceeded = 0;
 	err = prop_local_init_percpu(&bdi->completions);
 
--- linux-next.orig/fs/fs-writeback.c	2010-12-13 21:46:10.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-12-13 21:46:14.000000000 +0800
@@ -668,6 +668,8 @@ static long wb_writeback(struct bdi_writ
 		write_chunk = LONG_MAX;
 
 	wbc.wb_start = jiffies; /* livelock avoidance */
+	bdi_update_write_bandwidth(wb->bdi, wbc.wb_start);
+
 	for (;;) {
 		/*
 		 * Stop writeback when nr_pages has been consumed
@@ -703,6 +705,8 @@ static long wb_writeback(struct bdi_writ
 			writeback_inodes_wb(wb, &wbc);
 		trace_wbc_writeback_written(&wbc, wb->bdi);
 
+		bdi_update_write_bandwidth(wb->bdi, wbc.wb_start);
+
 		work->nr_pages -= write_chunk - wbc.nr_to_write;
 		wrote += write_chunk - wbc.nr_to_write;
 
--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:13.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:14.000000000 +0800
@@ -521,6 +521,56 @@ out:
 	return 1 + int_sqrt(dirty_thresh - dirty_pages);
 }
 
+static void __bdi_update_write_bandwidth(struct backing_dev_info *bdi,
+					 unsigned long elapsed,
+					 unsigned long written)
+{
+	const unsigned long period = roundup_pow_of_two(HZ);
+	u64 bw;
+
+	bw = written - bdi->written_stamp;
+	bw *= HZ;
+	if (elapsed > period / 2) {
+		do_div(bw, elapsed);
+		elapsed = period / 2;
+		bw *= elapsed;
+	}
+	bw += (u64)bdi->write_bandwidth * (period - elapsed);
+	bdi->write_bandwidth = bw >> ilog2(period);
+}
+
+void bdi_update_bandwidth(struct backing_dev_info *bdi,
+			  unsigned long start_time,
+			  unsigned long bdi_dirty,
+			  unsigned long bdi_thresh)
+{
+	unsigned long elapsed;
+	unsigned long written;
+
+	if (!spin_trylock(&bdi->bw_lock))
+		return;
+
+	elapsed = jiffies - bdi->bw_time_stamp;
+	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
+
+	/* skip quiet periods when disk bandwidth is under-utilized */
+	if (elapsed > HZ/2 &&
+	    elapsed > jiffies - start_time)
+		goto snapshot;
+
+	/* rate-limit, only update once every 100ms */
+	if (elapsed <= HZ/10)
+		goto unlock;
+
+	__bdi_update_write_bandwidth(bdi, elapsed, written);
+
+snapshot:
+	bdi->written_stamp = written;
+	bdi->bw_time_stamp = jiffies;
+unlock:
+	spin_unlock(&bdi->bw_lock);
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -537,11 +587,12 @@ static void balance_dirty_pages(struct a
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long bw;
+	unsigned long long bw;
 	unsigned long period;
 	unsigned long pause = 0;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	unsigned long start_time = jiffies;
 
 	for (;;) {
 		/*
@@ -585,17 +636,19 @@ static void balance_dirty_pages(struct a
 				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
+		bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
+
 		if (bdi_dirty >= bdi_thresh || nr_dirty > dirty_thresh) {
 			pause = MAX_PAUSE;
 			goto pause;
 		}
 
-		bw = 100 << 20; /* use static 100MB/s for the moment */
+		bw = bdi->write_bandwidth;
 
 		bw = bw * (bdi_thresh - bdi_dirty);
 		do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
 
-		period = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1) + 1;
+		period = HZ * pages_dirtied / ((unsigned long)bw + 1) + 1;
 		pause = current->paused_when + period - jiffies;
 		/*
 		 * Take it as long think time if pause falls into (-10s, 0).
--- linux-next.orig/include/linux/writeback.h	2010-12-13 21:46:12.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-12-13 21:46:14.000000000 +0800
@@ -139,6 +139,16 @@ unsigned long bdi_dirty_limit(struct bac
 			       unsigned long dirty,
 			       unsigned long dirty_pages);
 
+void bdi_update_bandwidth(struct backing_dev_info *bdi,
+			  unsigned long start_time,
+			  unsigned long bdi_dirty,
+			  unsigned long bdi_thresh);
+static inline void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
+					      unsigned long start_time)
+{
+	bdi_update_bandwidth(bdi, start_time, 0, 0);
+}
+
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 					unsigned long nr_pages_dirtied);



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 10/35] writeback: bdi write bandwidth estimation
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Li Shaohua, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Peter Zijlstra, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bandwidth-estimation-in-flusher.patch --]
[-- Type: text/plain, Size: 6828 bytes --]

The estimation value will start from 100MB/s and adapt to the real
bandwidth in seconds.  It's pretty accurate for common filesystems.

As the first use case, it replaces the fixed 100MB/s value used for
throttle bandwidth calculation in balance_dirty_pages().

The overheads won't be high because the bdi bandwidth update only occurs
in >10ms intervals.

Initially it's only estimated in balance_dirty_pages() because this is
the most reliable place to get reasonable large bandwidth -- the bdi is
normally fully utilized when bdi_thresh is reached.

Then Shaohua recommends to also do it in the flusher thread, to keep the
value updated when there are only periodic/background writeback and no
tasks throttled.

The original plan is to use per-cpu vars for bdi->write_bandwidth.
However Peter suggested that it opens the window that some CPU see
outdated values. So switch to use spinlock protected global vars.

It tries to update the bandwidth only when disk is fully utilized.
Any inactive period of more than 500ms will be skipped.

The estimation is not done purely in the flusher thread because slow
devices may take dozens of seconds to write the initial 64MB chunk
(write_bandwidth starts with 100MB/s, this translates to 64MB
nr_to_write). So it may take more than 1 minute to adapt to the smallish
bandwidth if the bandwidth is only updated in the flusher thread.

CC: Li Shaohua <shaohua.li@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |    4 ++
 include/linux/backing-dev.h |    5 ++
 include/linux/writeback.h   |   10 +++++
 mm/backing-dev.c            |    3 +
 mm/page-writeback.c         |   59 ++++++++++++++++++++++++++++++++--
 5 files changed, 78 insertions(+), 3 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2010-12-13 21:46:13.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-12-13 21:46:14.000000000 +0800
@@ -74,6 +74,11 @@ struct backing_dev_info {
 
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
+	spinlock_t bw_lock;
+	unsigned long bw_time_stamp;
+	unsigned long written_stamp;
+	unsigned long write_bandwidth;
+
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
--- linux-next.orig/mm/backing-dev.c	2010-12-13 21:46:13.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-12-13 21:46:14.000000000 +0800
@@ -660,6 +660,9 @@ int bdi_init(struct backing_dev_info *bd
 			goto err;
 	}
 
+	spin_lock_init(&bdi->bw_lock);
+	bdi->write_bandwidth = 100 << (20 - PAGE_SHIFT);  /* 100 MB/s */
+
 	bdi->dirty_exceeded = 0;
 	err = prop_local_init_percpu(&bdi->completions);
 
--- linux-next.orig/fs/fs-writeback.c	2010-12-13 21:46:10.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-12-13 21:46:14.000000000 +0800
@@ -668,6 +668,8 @@ static long wb_writeback(struct bdi_writ
 		write_chunk = LONG_MAX;
 
 	wbc.wb_start = jiffies; /* livelock avoidance */
+	bdi_update_write_bandwidth(wb->bdi, wbc.wb_start);
+
 	for (;;) {
 		/*
 		 * Stop writeback when nr_pages has been consumed
@@ -703,6 +705,8 @@ static long wb_writeback(struct bdi_writ
 			writeback_inodes_wb(wb, &wbc);
 		trace_wbc_writeback_written(&wbc, wb->bdi);
 
+		bdi_update_write_bandwidth(wb->bdi, wbc.wb_start);
+
 		work->nr_pages -= write_chunk - wbc.nr_to_write;
 		wrote += write_chunk - wbc.nr_to_write;
 
--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:13.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:14.000000000 +0800
@@ -521,6 +521,56 @@ out:
 	return 1 + int_sqrt(dirty_thresh - dirty_pages);
 }
 
+static void __bdi_update_write_bandwidth(struct backing_dev_info *bdi,
+					 unsigned long elapsed,
+					 unsigned long written)
+{
+	const unsigned long period = roundup_pow_of_two(HZ);
+	u64 bw;
+
+	bw = written - bdi->written_stamp;
+	bw *= HZ;
+	if (elapsed > period / 2) {
+		do_div(bw, elapsed);
+		elapsed = period / 2;
+		bw *= elapsed;
+	}
+	bw += (u64)bdi->write_bandwidth * (period - elapsed);
+	bdi->write_bandwidth = bw >> ilog2(period);
+}
+
+void bdi_update_bandwidth(struct backing_dev_info *bdi,
+			  unsigned long start_time,
+			  unsigned long bdi_dirty,
+			  unsigned long bdi_thresh)
+{
+	unsigned long elapsed;
+	unsigned long written;
+
+	if (!spin_trylock(&bdi->bw_lock))
+		return;
+
+	elapsed = jiffies - bdi->bw_time_stamp;
+	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
+
+	/* skip quiet periods when disk bandwidth is under-utilized */
+	if (elapsed > HZ/2 &&
+	    elapsed > jiffies - start_time)
+		goto snapshot;
+
+	/* rate-limit, only update once every 100ms */
+	if (elapsed <= HZ/10)
+		goto unlock;
+
+	__bdi_update_write_bandwidth(bdi, elapsed, written);
+
+snapshot:
+	bdi->written_stamp = written;
+	bdi->bw_time_stamp = jiffies;
+unlock:
+	spin_unlock(&bdi->bw_lock);
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -537,11 +587,12 @@ static void balance_dirty_pages(struct a
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long bw;
+	unsigned long long bw;
 	unsigned long period;
 	unsigned long pause = 0;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	unsigned long start_time = jiffies;
 
 	for (;;) {
 		/*
@@ -585,17 +636,19 @@ static void balance_dirty_pages(struct a
 				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
+		bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
+
 		if (bdi_dirty >= bdi_thresh || nr_dirty > dirty_thresh) {
 			pause = MAX_PAUSE;
 			goto pause;
 		}
 
-		bw = 100 << 20; /* use static 100MB/s for the moment */
+		bw = bdi->write_bandwidth;
 
 		bw = bw * (bdi_thresh - bdi_dirty);
 		do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
 
-		period = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1) + 1;
+		period = HZ * pages_dirtied / ((unsigned long)bw + 1) + 1;
 		pause = current->paused_when + period - jiffies;
 		/*
 		 * Take it as long think time if pause falls into (-10s, 0).
--- linux-next.orig/include/linux/writeback.h	2010-12-13 21:46:12.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-12-13 21:46:14.000000000 +0800
@@ -139,6 +139,16 @@ unsigned long bdi_dirty_limit(struct bac
 			       unsigned long dirty,
 			       unsigned long dirty_pages);
 
+void bdi_update_bandwidth(struct backing_dev_info *bdi,
+			  unsigned long start_time,
+			  unsigned long bdi_dirty,
+			  unsigned long bdi_thresh);
+static inline void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
+					      unsigned long start_time)
+{
+	bdi_update_bandwidth(bdi, start_time, 0, 0);
+}
+
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 					unsigned long nr_pages_dirtied);

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 11/35] writeback: show bdi write bandwidth in debugfs
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:46   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Theodore Tso, Peter Zijlstra, Wu Fengguang,
	Christoph Hellwig, Trond Myklebust, Dave Chinner, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bandwidth-show.patch --]
[-- Type: text/plain, Size: 2031 bytes --]

Add a "BdiWriteBandwidth" entry (and indent others) in /debug/bdi/*/stats.

btw increase digital field width to 10, for keeping the possibly
huge BdiWritten number aligned at least for desktop systems.

This will break user space tools if they are dumb enough to depend on
the number of white spaces.

CC: Theodore Ts'o <tytso@mit.edu>
CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/backing-dev.c |   24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

--- linux-next.orig/mm/backing-dev.c	2010-12-13 21:46:14.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-12-13 21:46:14.000000000 +0800
@@ -87,21 +87,23 @@ static int bdi_debug_stats_show(struct s
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	seq_printf(m,
-		   "BdiWriteback:     %8lu kB\n"
-		   "BdiReclaimable:   %8lu kB\n"
-		   "BdiDirtyThresh:   %8lu kB\n"
-		   "DirtyThresh:      %8lu kB\n"
-		   "BackgroundThresh: %8lu kB\n"
-		   "BdiWritten:       %8lu kB\n"
-		   "b_dirty:          %8lu\n"
-		   "b_io:             %8lu\n"
-		   "b_more_io:        %8lu\n"
-		   "bdi_list:         %8u\n"
-		   "state:            %8lx\n",
+		   "BdiWriteback:       %10lu kB\n"
+		   "BdiReclaimable:     %10lu kB\n"
+		   "BdiDirtyThresh:     %10lu kB\n"
+		   "DirtyThresh:        %10lu kB\n"
+		   "BackgroundThresh:   %10lu kB\n"
+		   "BdiWritten:         %10lu kB\n"
+		   "BdiWriteBandwidth:  %10lu kBps\n"
+		   "b_dirty:            %10lu\n"
+		   "b_io:               %10lu\n"
+		   "b_more_io:          %10lu\n"
+		   "bdi_list:           %10u\n"
+		   "state:              %10lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
 		   K(bdi_thresh), K(dirty_thresh), K(background_thresh),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+		   (unsigned long) K(bdi->write_bandwidth),
 		   nr_dirty, nr_io, nr_more_io,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 11/35] writeback: show bdi write bandwidth in debugfs
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Theodore Tso, Peter Zijlstra, Wu Fengguang,
	Christoph Hellwig, Trond Myklebust, Dave Chinner, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bandwidth-show.patch --]
[-- Type: text/plain, Size: 2327 bytes --]

Add a "BdiWriteBandwidth" entry (and indent others) in /debug/bdi/*/stats.

btw increase digital field width to 10, for keeping the possibly
huge BdiWritten number aligned at least for desktop systems.

This will break user space tools if they are dumb enough to depend on
the number of white spaces.

CC: Theodore Ts'o <tytso@mit.edu>
CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/backing-dev.c |   24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

--- linux-next.orig/mm/backing-dev.c	2010-12-13 21:46:14.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-12-13 21:46:14.000000000 +0800
@@ -87,21 +87,23 @@ static int bdi_debug_stats_show(struct s
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	seq_printf(m,
-		   "BdiWriteback:     %8lu kB\n"
-		   "BdiReclaimable:   %8lu kB\n"
-		   "BdiDirtyThresh:   %8lu kB\n"
-		   "DirtyThresh:      %8lu kB\n"
-		   "BackgroundThresh: %8lu kB\n"
-		   "BdiWritten:       %8lu kB\n"
-		   "b_dirty:          %8lu\n"
-		   "b_io:             %8lu\n"
-		   "b_more_io:        %8lu\n"
-		   "bdi_list:         %8u\n"
-		   "state:            %8lx\n",
+		   "BdiWriteback:       %10lu kB\n"
+		   "BdiReclaimable:     %10lu kB\n"
+		   "BdiDirtyThresh:     %10lu kB\n"
+		   "DirtyThresh:        %10lu kB\n"
+		   "BackgroundThresh:   %10lu kB\n"
+		   "BdiWritten:         %10lu kB\n"
+		   "BdiWriteBandwidth:  %10lu kBps\n"
+		   "b_dirty:            %10lu\n"
+		   "b_io:               %10lu\n"
+		   "b_more_io:          %10lu\n"
+		   "bdi_list:           %10u\n"
+		   "state:              %10lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
 		   K(bdi_thresh), K(dirty_thresh), K(background_thresh),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+		   (unsigned long) K(bdi->write_bandwidth),
 		   nr_dirty, nr_io, nr_more_io,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 11/35] writeback: show bdi write bandwidth in debugfs
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Theodore Tso, Peter Zijlstra, Wu Fengguang,
	Christoph Hellwig, Trond Myklebust, Dave Chinner, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bandwidth-show.patch --]
[-- Type: text/plain, Size: 2327 bytes --]

Add a "BdiWriteBandwidth" entry (and indent others) in /debug/bdi/*/stats.

btw increase digital field width to 10, for keeping the possibly
huge BdiWritten number aligned at least for desktop systems.

This will break user space tools if they are dumb enough to depend on
the number of white spaces.

CC: Theodore Ts'o <tytso@mit.edu>
CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/backing-dev.c |   24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

--- linux-next.orig/mm/backing-dev.c	2010-12-13 21:46:14.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-12-13 21:46:14.000000000 +0800
@@ -87,21 +87,23 @@ static int bdi_debug_stats_show(struct s
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	seq_printf(m,
-		   "BdiWriteback:     %8lu kB\n"
-		   "BdiReclaimable:   %8lu kB\n"
-		   "BdiDirtyThresh:   %8lu kB\n"
-		   "DirtyThresh:      %8lu kB\n"
-		   "BackgroundThresh: %8lu kB\n"
-		   "BdiWritten:       %8lu kB\n"
-		   "b_dirty:          %8lu\n"
-		   "b_io:             %8lu\n"
-		   "b_more_io:        %8lu\n"
-		   "bdi_list:         %8u\n"
-		   "state:            %8lx\n",
+		   "BdiWriteback:       %10lu kB\n"
+		   "BdiReclaimable:     %10lu kB\n"
+		   "BdiDirtyThresh:     %10lu kB\n"
+		   "DirtyThresh:        %10lu kB\n"
+		   "BackgroundThresh:   %10lu kB\n"
+		   "BdiWritten:         %10lu kB\n"
+		   "BdiWriteBandwidth:  %10lu kBps\n"
+		   "b_dirty:            %10lu\n"
+		   "b_io:               %10lu\n"
+		   "b_more_io:          %10lu\n"
+		   "bdi_list:           %10u\n"
+		   "state:              %10lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
 		   K(bdi_thresh), K(dirty_thresh), K(background_thresh),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+		   (unsigned long) K(bdi->write_bandwidth),
 		   nr_dirty, nr_io, nr_more_io,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 12/35] writeback: scale down max throttle bandwidth on concurrent dirtiers
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:46   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-adaptive-throttle-bandwidth.patch --]
[-- Type: text/plain, Size: 3835 bytes --]

This will noticeably reduce the fluctuaions of pause time when there are
100+ concurrent dirtiers.

The more parallel dirtiers (1 dirtier => 4 dirtiers), the smaller
bandwidth each dirtier will share (bdi_bandwidth => bdi_bandwidth/4),
the less gap to the dirty limit ((C-A) => (C-B)), the less stable the
pause time will be (given the same fluctuation of bdi_dirty).

For example, if A drifts to A', its pause time may drift from 5ms to
6ms, while B to B' may drift from 50ms to 90ms.  It's much larger
fluctuations in relative ratio as well as absolute time.

Fig.1 before patch, gap (C-B) is too low to get smooth pause time

throttle_bandwidth_A = bdi_bandwidth .........o
                                              | o <= A'
                                              |   o
                                              |     o
                                              |       o
                                              |         o
throttle_bandwidth_B = bdi_bandwidth / 4 .....|...........o
                                              |           | o <= B'
----------------------------------------------+-----------+---o
                                              A           B   C

The solution is to lower the slope of the throttle line accordingly,
which makes B stabilize at some point more far away from C.

Fig.2 after patch

throttle_bandwidth_A = bdi_bandwidth .........o
                                              | o <= A'
                                              |   o
                                              |     o
    lowered max throttle bandwidth for B ===> *       o
                                              |   *     o
throttle_bandwidth_B = bdi_bandwidth / 4 .............*   o
                                              |       |   * o
----------------------------------------------+-------+-------o
                                              A       B       C

Note that C is actually different points for 1-dirty and 4-dirtiers
cases, but for easy graphing, we move them together.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:14.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:15.000000000 +0800
@@ -587,6 +587,7 @@ static void balance_dirty_pages(struct a
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
+	unsigned long task_thresh;
 	unsigned long long bw;
 	unsigned long period;
 	unsigned long pause = 0;
@@ -616,7 +617,7 @@ static void balance_dirty_pages(struct a
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh, nr_dirty);
-		bdi_thresh = task_dirty_limit(current, bdi_thresh);
+		task_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
@@ -638,14 +639,23 @@ static void balance_dirty_pages(struct a
 
 		bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
 
-		if (bdi_dirty >= bdi_thresh || nr_dirty > dirty_thresh) {
+		if (bdi_dirty >= task_thresh || nr_dirty > dirty_thresh) {
 			pause = MAX_PAUSE;
 			goto pause;
 		}
 
+		/*
+		 * When bdi_dirty grows closer to bdi_thresh, it indicates more
+		 * concurrent dirtiers. Proportionally lower the max throttle
+		 * bandwidth. This will resist bdi_dirty from approaching to
+		 * close to task_thresh, and help reduce fluctuations of pause
+		 * time when there are lots of dirtiers.
+		 */
 		bw = bdi->write_bandwidth;
-
 		bw = bw * (bdi_thresh - bdi_dirty);
+		do_div(bw, bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);
+
+		bw = bw * (task_thresh - bdi_dirty);
 		do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
 
 		period = HZ * pages_dirtied / ((unsigned long)bw + 1) + 1;



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 12/35] writeback: scale down max throttle bandwidth on concurrent dirtiers
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-adaptive-throttle-bandwidth.patch --]
[-- Type: text/plain, Size: 4131 bytes --]

This will noticeably reduce the fluctuaions of pause time when there are
100+ concurrent dirtiers.

The more parallel dirtiers (1 dirtier => 4 dirtiers), the smaller
bandwidth each dirtier will share (bdi_bandwidth => bdi_bandwidth/4),
the less gap to the dirty limit ((C-A) => (C-B)), the less stable the
pause time will be (given the same fluctuation of bdi_dirty).

For example, if A drifts to A', its pause time may drift from 5ms to
6ms, while B to B' may drift from 50ms to 90ms.  It's much larger
fluctuations in relative ratio as well as absolute time.

Fig.1 before patch, gap (C-B) is too low to get smooth pause time

throttle_bandwidth_A = bdi_bandwidth .........o
                                              | o <= A'
                                              |   o
                                              |     o
                                              |       o
                                              |         o
throttle_bandwidth_B = bdi_bandwidth / 4 .....|...........o
                                              |           | o <= B'
----------------------------------------------+-----------+---o
                                              A           B   C

The solution is to lower the slope of the throttle line accordingly,
which makes B stabilize at some point more far away from C.

Fig.2 after patch

throttle_bandwidth_A = bdi_bandwidth .........o
                                              | o <= A'
                                              |   o
                                              |     o
    lowered max throttle bandwidth for B ===> *       o
                                              |   *     o
throttle_bandwidth_B = bdi_bandwidth / 4 .............*   o
                                              |       |   * o
----------------------------------------------+-------+-------o
                                              A       B       C

Note that C is actually different points for 1-dirty and 4-dirtiers
cases, but for easy graphing, we move them together.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:14.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:15.000000000 +0800
@@ -587,6 +587,7 @@ static void balance_dirty_pages(struct a
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
+	unsigned long task_thresh;
 	unsigned long long bw;
 	unsigned long period;
 	unsigned long pause = 0;
@@ -616,7 +617,7 @@ static void balance_dirty_pages(struct a
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh, nr_dirty);
-		bdi_thresh = task_dirty_limit(current, bdi_thresh);
+		task_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
@@ -638,14 +639,23 @@ static void balance_dirty_pages(struct a
 
 		bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
 
-		if (bdi_dirty >= bdi_thresh || nr_dirty > dirty_thresh) {
+		if (bdi_dirty >= task_thresh || nr_dirty > dirty_thresh) {
 			pause = MAX_PAUSE;
 			goto pause;
 		}
 
+		/*
+		 * When bdi_dirty grows closer to bdi_thresh, it indicates more
+		 * concurrent dirtiers. Proportionally lower the max throttle
+		 * bandwidth. This will resist bdi_dirty from approaching to
+		 * close to task_thresh, and help reduce fluctuations of pause
+		 * time when there are lots of dirtiers.
+		 */
 		bw = bdi->write_bandwidth;
-
 		bw = bw * (bdi_thresh - bdi_dirty);
+		do_div(bw, bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);
+
+		bw = bw * (task_thresh - bdi_dirty);
 		do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
 
 		period = HZ * pages_dirtied / ((unsigned long)bw + 1) + 1;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 12/35] writeback: scale down max throttle bandwidth on concurrent dirtiers
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-adaptive-throttle-bandwidth.patch --]
[-- Type: text/plain, Size: 4131 bytes --]

This will noticeably reduce the fluctuaions of pause time when there are
100+ concurrent dirtiers.

The more parallel dirtiers (1 dirtier => 4 dirtiers), the smaller
bandwidth each dirtier will share (bdi_bandwidth => bdi_bandwidth/4),
the less gap to the dirty limit ((C-A) => (C-B)), the less stable the
pause time will be (given the same fluctuation of bdi_dirty).

For example, if A drifts to A', its pause time may drift from 5ms to
6ms, while B to B' may drift from 50ms to 90ms.  It's much larger
fluctuations in relative ratio as well as absolute time.

Fig.1 before patch, gap (C-B) is too low to get smooth pause time

throttle_bandwidth_A = bdi_bandwidth .........o
                                              | o <= A'
                                              |   o
                                              |     o
                                              |       o
                                              |         o
throttle_bandwidth_B = bdi_bandwidth / 4 .....|...........o
                                              |           | o <= B'
----------------------------------------------+-----------+---o
                                              A           B   C

The solution is to lower the slope of the throttle line accordingly,
which makes B stabilize at some point more far away from C.

Fig.2 after patch

throttle_bandwidth_A = bdi_bandwidth .........o
                                              | o <= A'
                                              |   o
                                              |     o
    lowered max throttle bandwidth for B ===> *       o
                                              |   *     o
throttle_bandwidth_B = bdi_bandwidth / 4 .............*   o
                                              |       |   * o
----------------------------------------------+-------+-------o
                                              A       B       C

Note that C is actually different points for 1-dirty and 4-dirtiers
cases, but for easy graphing, we move them together.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:14.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:15.000000000 +0800
@@ -587,6 +587,7 @@ static void balance_dirty_pages(struct a
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
+	unsigned long task_thresh;
 	unsigned long long bw;
 	unsigned long period;
 	unsigned long pause = 0;
@@ -616,7 +617,7 @@ static void balance_dirty_pages(struct a
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh, nr_dirty);
-		bdi_thresh = task_dirty_limit(current, bdi_thresh);
+		task_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
@@ -638,14 +639,23 @@ static void balance_dirty_pages(struct a
 
 		bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
 
-		if (bdi_dirty >= bdi_thresh || nr_dirty > dirty_thresh) {
+		if (bdi_dirty >= task_thresh || nr_dirty > dirty_thresh) {
 			pause = MAX_PAUSE;
 			goto pause;
 		}
 
+		/*
+		 * When bdi_dirty grows closer to bdi_thresh, it indicates more
+		 * concurrent dirtiers. Proportionally lower the max throttle
+		 * bandwidth. This will resist bdi_dirty from approaching to
+		 * close to task_thresh, and help reduce fluctuations of pause
+		 * time when there are lots of dirtiers.
+		 */
 		bw = bdi->write_bandwidth;
-
 		bw = bw * (bdi_thresh - bdi_dirty);
+		do_div(bw, bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);
+
+		bw = bw * (task_thresh - bdi_dirty);
 		do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
 
 		period = HZ * pages_dirtied / ((unsigned long)bw + 1) + 1;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 13/35] writeback: bdi base throttle bandwidth
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:46   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bw-for-concurrent-dirtiers.patch --]
[-- Type: text/plain, Size: 6417 bytes --]

This basically does

-	task_bw = linear_function(task_weight, bdi_dirty, bdi->write_bandwidth)
+	task_bw = linear_function(task_weight, bdi_dirty, bdi->throttle_bandwidth)

where
                                    adapt to
	bdi->throttle_bandwidth ================> bdi->write_bandwidth / N
	                        stabilize around

	N = number of concurrent heavy dirtier tasks
	    (light dirtiers will have little effect)

It offers two great benefits:

1) in many configurations (eg. NFS), bdi->write_bandwidth fluctuates a lot
   (more than 100%) by nature. bdi->throttle_bandwidth will be much more
   stable.  It will normally be a flat line in the time-bw graph.

2) bdi->throttle_bandwidth will be close to the final task_bw in stable state.
   In contrast, bdi->write_bandwidth is N times larger than task_bw.
   Given N=4, bdi_dirty will float around A before patch, and we want it
   stabilize around B by lowering the slope of the control line, so that
   when bdi_dirty fluctuates for the same delta (to points A'/B'), the
   corresponding fluctuation of task_bw is reduced to 1/4. The benefit
   is obvious: when there are 1000 concurrent dirtiers, the fluctuations
   quickly go out of control; with this patch, the max fluctuations
   virtually are the same as the single dirtier case. In this way, the
   control system can scale to whatever huge number of dirtiers.

fig.1 before patch

               bdi->write_bandwidth   ........o
                                               o
                                                o
                                                 o
                                                  o
                                                   o
                                                    o
                                                     o
                                                      o
                                                       o
                                                        o
                                                         o
   task_bw = bdi->write_bandwidth / 4 ....................o
                                                          |o
                                                          | o
                                                          |  o <= A'
----------------------------------------------------------+---o
                                                          A   C

fig.2 after patch

task_bw = bdi->throttle_bandwidth     ........o
        = bdi->write_bandwidth / 4            |   o <= B'
                                              |       o
                                              |           o
----------------------------------------------+---------------o
                                              B               C

The added complexity is, it will take some time for
bdi->throttle_bandwidth to adapt to the workload:

- 2 seconds to scale to 10 times more dirtier tasks
- 10 seconds to 10 times less dirtier tasks

The slower adapt time to reduced tasks is not a big problem. Because
the control line is not linear. At worst, bdi_dirty will drop below the
15% throttle threshold where the tasks won't be throttled at all.

When the system has dirtiers of different speed, bdi->throttle_bandwidth
will adapt to around the most fast speed.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |   42 +++++++++++++++++++++++++++++++++-
 3 files changed, 43 insertions(+), 1 deletion(-)

--- linux-next.orig/include/linux/backing-dev.h	2010-12-13 21:46:14.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-12-13 21:46:15.000000000 +0800
@@ -78,6 +78,7 @@ struct backing_dev_info {
 	unsigned long bw_time_stamp;
 	unsigned long written_stamp;
 	unsigned long write_bandwidth;
+	unsigned long throttle_bandwidth;
 
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:15.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:15.000000000 +0800
@@ -521,6 +521,45 @@ out:
 	return 1 + int_sqrt(dirty_thresh - dirty_pages);
 }
 
+/*
+ * The bdi throttle bandwidth is introduced for resisting bdi_dirty from
+ * getting too close to task_thresh. It allows scaling up to 1000+ concurrent
+ * dirtier tasks while keeping the fluctuation level flat.
+ */
+static void __bdi_update_throttle_bandwidth(struct backing_dev_info *bdi,
+					    unsigned long dirty,
+					    unsigned long thresh)
+{
+	unsigned long gap = thresh / TASK_SOFT_DIRTY_LIMIT + 1;
+	unsigned long bw = bdi->throttle_bandwidth;
+
+	if (dirty > thresh)
+		return;
+
+	/* adapt to concurrent dirtiers */
+	if (dirty > thresh - gap) {
+		bw -= bw >> (3 + 4 * (thresh - dirty) / gap);
+		goto out;
+	}
+
+	/* adapt to one single dirtier */
+	if (dirty > thresh - gap * 2 + gap / 4 &&
+	    bw > bdi->write_bandwidth + bdi->write_bandwidth / 2) {
+		bw -= bw >> (3 + 4 * (thresh - dirty - gap) / gap);
+		goto out;
+	}
+
+	if (dirty <= thresh - gap * 2 - gap / 2 &&
+	    bw < bdi->write_bandwidth - bdi->write_bandwidth / 2) {
+		bw += (bw >> 4) + 1;
+		goto out;
+	}
+
+	return;
+out:
+	bdi->throttle_bandwidth = bw;
+}
+
 static void __bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 					 unsigned long elapsed,
 					 unsigned long written)
@@ -563,6 +602,7 @@ void bdi_update_bandwidth(struct backing
 		goto unlock;
 
 	__bdi_update_write_bandwidth(bdi, elapsed, written);
+	__bdi_update_throttle_bandwidth(bdi, bdi_dirty, bdi_thresh);
 
 snapshot:
 	bdi->written_stamp = written;
@@ -651,7 +691,7 @@ static void balance_dirty_pages(struct a
 		 * close to task_thresh, and help reduce fluctuations of pause
 		 * time when there are lots of dirtiers.
 		 */
-		bw = bdi->write_bandwidth;
+		bw = bdi->throttle_bandwidth;
 		bw = bw * (bdi_thresh - bdi_dirty);
 		do_div(bw, bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);
 
--- linux-next.orig/mm/backing-dev.c	2010-12-13 21:46:14.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-12-13 21:46:15.000000000 +0800
@@ -664,6 +664,7 @@ int bdi_init(struct backing_dev_info *bd
 
 	spin_lock_init(&bdi->bw_lock);
 	bdi->write_bandwidth = 100 << (20 - PAGE_SHIFT);  /* 100 MB/s */
+	bdi->throttle_bandwidth = 100 << (20 - PAGE_SHIFT);
 
 	bdi->dirty_exceeded = 0;
 	err = prop_local_init_percpu(&bdi->completions);



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 13/35] writeback: bdi base throttle bandwidth
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bw-for-concurrent-dirtiers.patch --]
[-- Type: text/plain, Size: 6713 bytes --]

This basically does

-	task_bw = linear_function(task_weight, bdi_dirty, bdi->write_bandwidth)
+	task_bw = linear_function(task_weight, bdi_dirty, bdi->throttle_bandwidth)

where
                                    adapt to
	bdi->throttle_bandwidth ================> bdi->write_bandwidth / N
	                        stabilize around

	N = number of concurrent heavy dirtier tasks
	    (light dirtiers will have little effect)

It offers two great benefits:

1) in many configurations (eg. NFS), bdi->write_bandwidth fluctuates a lot
   (more than 100%) by nature. bdi->throttle_bandwidth will be much more
   stable.  It will normally be a flat line in the time-bw graph.

2) bdi->throttle_bandwidth will be close to the final task_bw in stable state.
   In contrast, bdi->write_bandwidth is N times larger than task_bw.
   Given N=4, bdi_dirty will float around A before patch, and we want it
   stabilize around B by lowering the slope of the control line, so that
   when bdi_dirty fluctuates for the same delta (to points A'/B'), the
   corresponding fluctuation of task_bw is reduced to 1/4. The benefit
   is obvious: when there are 1000 concurrent dirtiers, the fluctuations
   quickly go out of control; with this patch, the max fluctuations
   virtually are the same as the single dirtier case. In this way, the
   control system can scale to whatever huge number of dirtiers.

fig.1 before patch

               bdi->write_bandwidth   ........o
                                               o
                                                o
                                                 o
                                                  o
                                                   o
                                                    o
                                                     o
                                                      o
                                                       o
                                                        o
                                                         o
   task_bw = bdi->write_bandwidth / 4 ....................o
                                                          |o
                                                          | o
                                                          |  o <= A'
----------------------------------------------------------+---o
                                                          A   C

fig.2 after patch

task_bw = bdi->throttle_bandwidth     ........o
        = bdi->write_bandwidth / 4            |   o <= B'
                                              |       o
                                              |           o
----------------------------------------------+---------------o
                                              B               C

The added complexity is, it will take some time for
bdi->throttle_bandwidth to adapt to the workload:

- 2 seconds to scale to 10 times more dirtier tasks
- 10 seconds to 10 times less dirtier tasks

The slower adapt time to reduced tasks is not a big problem. Because
the control line is not linear. At worst, bdi_dirty will drop below the
15% throttle threshold where the tasks won't be throttled at all.

When the system has dirtiers of different speed, bdi->throttle_bandwidth
will adapt to around the most fast speed.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |   42 +++++++++++++++++++++++++++++++++-
 3 files changed, 43 insertions(+), 1 deletion(-)

--- linux-next.orig/include/linux/backing-dev.h	2010-12-13 21:46:14.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-12-13 21:46:15.000000000 +0800
@@ -78,6 +78,7 @@ struct backing_dev_info {
 	unsigned long bw_time_stamp;
 	unsigned long written_stamp;
 	unsigned long write_bandwidth;
+	unsigned long throttle_bandwidth;
 
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:15.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:15.000000000 +0800
@@ -521,6 +521,45 @@ out:
 	return 1 + int_sqrt(dirty_thresh - dirty_pages);
 }
 
+/*
+ * The bdi throttle bandwidth is introduced for resisting bdi_dirty from
+ * getting too close to task_thresh. It allows scaling up to 1000+ concurrent
+ * dirtier tasks while keeping the fluctuation level flat.
+ */
+static void __bdi_update_throttle_bandwidth(struct backing_dev_info *bdi,
+					    unsigned long dirty,
+					    unsigned long thresh)
+{
+	unsigned long gap = thresh / TASK_SOFT_DIRTY_LIMIT + 1;
+	unsigned long bw = bdi->throttle_bandwidth;
+
+	if (dirty > thresh)
+		return;
+
+	/* adapt to concurrent dirtiers */
+	if (dirty > thresh - gap) {
+		bw -= bw >> (3 + 4 * (thresh - dirty) / gap);
+		goto out;
+	}
+
+	/* adapt to one single dirtier */
+	if (dirty > thresh - gap * 2 + gap / 4 &&
+	    bw > bdi->write_bandwidth + bdi->write_bandwidth / 2) {
+		bw -= bw >> (3 + 4 * (thresh - dirty - gap) / gap);
+		goto out;
+	}
+
+	if (dirty <= thresh - gap * 2 - gap / 2 &&
+	    bw < bdi->write_bandwidth - bdi->write_bandwidth / 2) {
+		bw += (bw >> 4) + 1;
+		goto out;
+	}
+
+	return;
+out:
+	bdi->throttle_bandwidth = bw;
+}
+
 static void __bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 					 unsigned long elapsed,
 					 unsigned long written)
@@ -563,6 +602,7 @@ void bdi_update_bandwidth(struct backing
 		goto unlock;
 
 	__bdi_update_write_bandwidth(bdi, elapsed, written);
+	__bdi_update_throttle_bandwidth(bdi, bdi_dirty, bdi_thresh);
 
 snapshot:
 	bdi->written_stamp = written;
@@ -651,7 +691,7 @@ static void balance_dirty_pages(struct a
 		 * close to task_thresh, and help reduce fluctuations of pause
 		 * time when there are lots of dirtiers.
 		 */
-		bw = bdi->write_bandwidth;
+		bw = bdi->throttle_bandwidth;
 		bw = bw * (bdi_thresh - bdi_dirty);
 		do_div(bw, bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);
 
--- linux-next.orig/mm/backing-dev.c	2010-12-13 21:46:14.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-12-13 21:46:15.000000000 +0800
@@ -664,6 +664,7 @@ int bdi_init(struct backing_dev_info *bd
 
 	spin_lock_init(&bdi->bw_lock);
 	bdi->write_bandwidth = 100 << (20 - PAGE_SHIFT);  /* 100 MB/s */
+	bdi->throttle_bandwidth = 100 << (20 - PAGE_SHIFT);
 
 	bdi->dirty_exceeded = 0;
 	err = prop_local_init_percpu(&bdi->completions);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 13/35] writeback: bdi base throttle bandwidth
@ 2010-12-13 14:46   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bw-for-concurrent-dirtiers.patch --]
[-- Type: text/plain, Size: 6713 bytes --]

This basically does

-	task_bw = linear_function(task_weight, bdi_dirty, bdi->write_bandwidth)
+	task_bw = linear_function(task_weight, bdi_dirty, bdi->throttle_bandwidth)

where
                                    adapt to
	bdi->throttle_bandwidth ================> bdi->write_bandwidth / N
	                        stabilize around

	N = number of concurrent heavy dirtier tasks
	    (light dirtiers will have little effect)

It offers two great benefits:

1) in many configurations (eg. NFS), bdi->write_bandwidth fluctuates a lot
   (more than 100%) by nature. bdi->throttle_bandwidth will be much more
   stable.  It will normally be a flat line in the time-bw graph.

2) bdi->throttle_bandwidth will be close to the final task_bw in stable state.
   In contrast, bdi->write_bandwidth is N times larger than task_bw.
   Given N=4, bdi_dirty will float around A before patch, and we want it
   stabilize around B by lowering the slope of the control line, so that
   when bdi_dirty fluctuates for the same delta (to points A'/B'), the
   corresponding fluctuation of task_bw is reduced to 1/4. The benefit
   is obvious: when there are 1000 concurrent dirtiers, the fluctuations
   quickly go out of control; with this patch, the max fluctuations
   virtually are the same as the single dirtier case. In this way, the
   control system can scale to whatever huge number of dirtiers.

fig.1 before patch

               bdi->write_bandwidth   ........o
                                               o
                                                o
                                                 o
                                                  o
                                                   o
                                                    o
                                                     o
                                                      o
                                                       o
                                                        o
                                                         o
   task_bw = bdi->write_bandwidth / 4 ....................o
                                                          |o
                                                          | o
                                                          |  o <= A'
----------------------------------------------------------+---o
                                                          A   C

fig.2 after patch

task_bw = bdi->throttle_bandwidth     ........o
        = bdi->write_bandwidth / 4            |   o <= B'
                                              |       o
                                              |           o
----------------------------------------------+---------------o
                                              B               C

The added complexity is, it will take some time for
bdi->throttle_bandwidth to adapt to the workload:

- 2 seconds to scale to 10 times more dirtier tasks
- 10 seconds to 10 times less dirtier tasks

The slower adapt time to reduced tasks is not a big problem. Because
the control line is not linear. At worst, bdi_dirty will drop below the
15% throttle threshold where the tasks won't be throttled at all.

When the system has dirtiers of different speed, bdi->throttle_bandwidth
will adapt to around the most fast speed.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |   42 +++++++++++++++++++++++++++++++++-
 3 files changed, 43 insertions(+), 1 deletion(-)

--- linux-next.orig/include/linux/backing-dev.h	2010-12-13 21:46:14.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-12-13 21:46:15.000000000 +0800
@@ -78,6 +78,7 @@ struct backing_dev_info {
 	unsigned long bw_time_stamp;
 	unsigned long written_stamp;
 	unsigned long write_bandwidth;
+	unsigned long throttle_bandwidth;
 
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:15.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:15.000000000 +0800
@@ -521,6 +521,45 @@ out:
 	return 1 + int_sqrt(dirty_thresh - dirty_pages);
 }
 
+/*
+ * The bdi throttle bandwidth is introduced for resisting bdi_dirty from
+ * getting too close to task_thresh. It allows scaling up to 1000+ concurrent
+ * dirtier tasks while keeping the fluctuation level flat.
+ */
+static void __bdi_update_throttle_bandwidth(struct backing_dev_info *bdi,
+					    unsigned long dirty,
+					    unsigned long thresh)
+{
+	unsigned long gap = thresh / TASK_SOFT_DIRTY_LIMIT + 1;
+	unsigned long bw = bdi->throttle_bandwidth;
+
+	if (dirty > thresh)
+		return;
+
+	/* adapt to concurrent dirtiers */
+	if (dirty > thresh - gap) {
+		bw -= bw >> (3 + 4 * (thresh - dirty) / gap);
+		goto out;
+	}
+
+	/* adapt to one single dirtier */
+	if (dirty > thresh - gap * 2 + gap / 4 &&
+	    bw > bdi->write_bandwidth + bdi->write_bandwidth / 2) {
+		bw -= bw >> (3 + 4 * (thresh - dirty - gap) / gap);
+		goto out;
+	}
+
+	if (dirty <= thresh - gap * 2 - gap / 2 &&
+	    bw < bdi->write_bandwidth - bdi->write_bandwidth / 2) {
+		bw += (bw >> 4) + 1;
+		goto out;
+	}
+
+	return;
+out:
+	bdi->throttle_bandwidth = bw;
+}
+
 static void __bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 					 unsigned long elapsed,
 					 unsigned long written)
@@ -563,6 +602,7 @@ void bdi_update_bandwidth(struct backing
 		goto unlock;
 
 	__bdi_update_write_bandwidth(bdi, elapsed, written);
+	__bdi_update_throttle_bandwidth(bdi, bdi_dirty, bdi_thresh);
 
 snapshot:
 	bdi->written_stamp = written;
@@ -651,7 +691,7 @@ static void balance_dirty_pages(struct a
 		 * close to task_thresh, and help reduce fluctuations of pause
 		 * time when there are lots of dirtiers.
 		 */
-		bw = bdi->write_bandwidth;
+		bw = bdi->throttle_bandwidth;
 		bw = bw * (bdi_thresh - bdi_dirty);
 		do_div(bw, bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);
 
--- linux-next.orig/mm/backing-dev.c	2010-12-13 21:46:14.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-12-13 21:46:15.000000000 +0800
@@ -664,6 +664,7 @@ int bdi_init(struct backing_dev_info *bd
 
 	spin_lock_init(&bdi->bw_lock);
 	bdi->write_bandwidth = 100 << (20 - PAGE_SHIFT);  /* 100 MB/s */
+	bdi->throttle_bandwidth = 100 << (20 - PAGE_SHIFT);
 
 	bdi->dirty_exceeded = 0;
 	err = prop_local_init_percpu(&bdi->completions);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 14/35] writeback: smoothed bdi dirty pages
  2010-12-13 14:46 ` Wu Fengguang
@ 2010-12-13 14:47   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-smoothed-bdi_dirty.patch --]
[-- Type: text/plain, Size: 4008 bytes --]

This basically does

-	task_bw = linear_function(task_weight, bdi_dirty, bdi->throttle_bandwidth)
+	task_bw = linear_function(task_weight, avg_dirty, bdi->throttle_bandwidth)

So that the fluctuations of bdi_dirty can be filtered by half.

The main problem is, bdi_dirty regularly drops low suddenly for dozens
of megabytes in NFS on the completion of COMMIT requests.  The same
problem, though less severe, exists for btrfs, xfs and maybe some types
of storages. avg_dirty can help filter out such downwards spikes.

Upwards spikes are also possible, and if does happen, should better be
fixed in the FS code.  To avoid exceeding the dirty limits, once
bdi_dirty exceeds avg_dirty, the higher value will instantly be used as
the feedback to the control system. So the control system cannot filter
out upwards spikes for the sake of safety.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    2 +
 mm/page-writeback.c         |   44 ++++++++++++++++++++++++++++++----
 2 files changed, 42 insertions(+), 4 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2010-12-13 21:46:15.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-12-13 21:46:15.000000000 +0800
@@ -79,6 +79,8 @@ struct backing_dev_info {
 	unsigned long written_stamp;
 	unsigned long write_bandwidth;
 	unsigned long throttle_bandwidth;
+	unsigned long avg_dirty;
+	unsigned long old_dirty;
 
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:15.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:15.000000000 +0800
@@ -521,6 +521,36 @@ out:
 	return 1 + int_sqrt(dirty_thresh - dirty_pages);
 }
 
+static void __bdi_update_dirty_smooth(struct backing_dev_info *bdi,
+				      unsigned long dirty,
+				      unsigned long thresh)
+{
+	unsigned long avg = bdi->avg_dirty;
+	unsigned long old = bdi->old_dirty;
+
+	/* skip call from the flusher */
+	if (!thresh)
+		return;
+
+	if (avg > thresh) {
+		avg = dirty;
+		goto update;
+	}
+
+	if (dirty <= avg && dirty >= old)
+		goto out;
+
+	if (dirty >= avg && dirty <= old)
+		goto out;
+
+	avg = (avg * 15 + dirty) / 16;
+
+update:
+	bdi->avg_dirty = avg;
+out:
+	bdi->old_dirty = dirty;
+}
+
 /*
  * The bdi throttle bandwidth is introduced for resisting bdi_dirty from
  * getting too close to task_thresh. It allows scaling up to 1000+ concurrent
@@ -601,8 +631,9 @@ void bdi_update_bandwidth(struct backing
 	if (elapsed <= HZ/10)
 		goto unlock;
 
+	__bdi_update_dirty_smooth(bdi, bdi_dirty, bdi_thresh);
 	__bdi_update_write_bandwidth(bdi, elapsed, written);
-	__bdi_update_throttle_bandwidth(bdi, bdi_dirty, bdi_thresh);
+	__bdi_update_throttle_bandwidth(bdi, bdi->avg_dirty, bdi_thresh);
 
 snapshot:
 	bdi->written_stamp = written;
@@ -624,6 +655,7 @@ static void balance_dirty_pages(struct a
 	long nr_reclaimable;
 	long nr_dirty;
 	long bdi_dirty;  /* = file_dirty + writeback + unstable_nfs */
+	long avg_dirty;  /* smoothed bdi_dirty */
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
@@ -679,7 +711,11 @@ static void balance_dirty_pages(struct a
 
 		bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
 
-		if (bdi_dirty >= task_thresh || nr_dirty > dirty_thresh) {
+		avg_dirty = bdi->avg_dirty;
+		if (avg_dirty < bdi_dirty || avg_dirty > task_thresh)
+			avg_dirty = bdi_dirty;
+
+		if (avg_dirty >= task_thresh || nr_dirty > dirty_thresh) {
 			pause = MAX_PAUSE;
 			goto pause;
 		}
@@ -692,10 +728,10 @@ static void balance_dirty_pages(struct a
 		 * time when there are lots of dirtiers.
 		 */
 		bw = bdi->throttle_bandwidth;
-		bw = bw * (bdi_thresh - bdi_dirty);
+		bw = bw * (bdi_thresh - avg_dirty);
 		do_div(bw, bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);
 
-		bw = bw * (task_thresh - bdi_dirty);
+		bw = bw * (task_thresh - avg_dirty);
 		do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
 
 		period = HZ * pages_dirtied / ((unsigned long)bw + 1) + 1;



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 14/35] writeback: smoothed bdi dirty pages
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-smoothed-bdi_dirty.patch --]
[-- Type: text/plain, Size: 4008 bytes --]

This basically does

-	task_bw = linear_function(task_weight, bdi_dirty, bdi->throttle_bandwidth)
+	task_bw = linear_function(task_weight, avg_dirty, bdi->throttle_bandwidth)

So that the fluctuations of bdi_dirty can be filtered by half.

The main problem is, bdi_dirty regularly drops low suddenly for dozens
of megabytes in NFS on the completion of COMMIT requests.  The same
problem, though less severe, exists for btrfs, xfs and maybe some types
of storages. avg_dirty can help filter out such downwards spikes.

Upwards spikes are also possible, and if does happen, should better be
fixed in the FS code.  To avoid exceeding the dirty limits, once
bdi_dirty exceeds avg_dirty, the higher value will instantly be used as
the feedback to the control system. So the control system cannot filter
out upwards spikes for the sake of safety.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    2 +
 mm/page-writeback.c         |   44 ++++++++++++++++++++++++++++++----
 2 files changed, 42 insertions(+), 4 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2010-12-13 21:46:15.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-12-13 21:46:15.000000000 +0800
@@ -79,6 +79,8 @@ struct backing_dev_info {
 	unsigned long written_stamp;
 	unsigned long write_bandwidth;
 	unsigned long throttle_bandwidth;
+	unsigned long avg_dirty;
+	unsigned long old_dirty;
 
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:15.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:15.000000000 +0800
@@ -521,6 +521,36 @@ out:
 	return 1 + int_sqrt(dirty_thresh - dirty_pages);
 }
 
+static void __bdi_update_dirty_smooth(struct backing_dev_info *bdi,
+				      unsigned long dirty,
+				      unsigned long thresh)
+{
+	unsigned long avg = bdi->avg_dirty;
+	unsigned long old = bdi->old_dirty;
+
+	/* skip call from the flusher */
+	if (!thresh)
+		return;
+
+	if (avg > thresh) {
+		avg = dirty;
+		goto update;
+	}
+
+	if (dirty <= avg && dirty >= old)
+		goto out;
+
+	if (dirty >= avg && dirty <= old)
+		goto out;
+
+	avg = (avg * 15 + dirty) / 16;
+
+update:
+	bdi->avg_dirty = avg;
+out:
+	bdi->old_dirty = dirty;
+}
+
 /*
  * The bdi throttle bandwidth is introduced for resisting bdi_dirty from
  * getting too close to task_thresh. It allows scaling up to 1000+ concurrent
@@ -601,8 +631,9 @@ void bdi_update_bandwidth(struct backing
 	if (elapsed <= HZ/10)
 		goto unlock;
 
+	__bdi_update_dirty_smooth(bdi, bdi_dirty, bdi_thresh);
 	__bdi_update_write_bandwidth(bdi, elapsed, written);
-	__bdi_update_throttle_bandwidth(bdi, bdi_dirty, bdi_thresh);
+	__bdi_update_throttle_bandwidth(bdi, bdi->avg_dirty, bdi_thresh);
 
 snapshot:
 	bdi->written_stamp = written;
@@ -624,6 +655,7 @@ static void balance_dirty_pages(struct a
 	long nr_reclaimable;
 	long nr_dirty;
 	long bdi_dirty;  /* = file_dirty + writeback + unstable_nfs */
+	long avg_dirty;  /* smoothed bdi_dirty */
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
@@ -679,7 +711,11 @@ static void balance_dirty_pages(struct a
 
 		bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
 
-		if (bdi_dirty >= task_thresh || nr_dirty > dirty_thresh) {
+		avg_dirty = bdi->avg_dirty;
+		if (avg_dirty < bdi_dirty || avg_dirty > task_thresh)
+			avg_dirty = bdi_dirty;
+
+		if (avg_dirty >= task_thresh || nr_dirty > dirty_thresh) {
 			pause = MAX_PAUSE;
 			goto pause;
 		}
@@ -692,10 +728,10 @@ static void balance_dirty_pages(struct a
 		 * time when there are lots of dirtiers.
 		 */
 		bw = bdi->throttle_bandwidth;
-		bw = bw * (bdi_thresh - bdi_dirty);
+		bw = bw * (bdi_thresh - avg_dirty);
 		do_div(bw, bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);
 
-		bw = bw * (task_thresh - bdi_dirty);
+		bw = bw * (task_thresh - avg_dirty);
 		do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
 
 		period = HZ * pages_dirtied / ((unsigned long)bw + 1) + 1;



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 15/35] writeback: adapt max balance pause time to memory size
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:47   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-max-pause-time-for-small-memory-system.patch --]
[-- Type: text/plain, Size: 2619 bytes --]

For small memory systems, sleeping for 200ms at a time is an overkill.
Given 4MB dirty limit, all the dirty/writeback pages will be written to
a 80MB/s disk within 50ms. If the task goes sleep for 200ms after it
dirtied 4MB, the disk will go idle for 150ms without any new data feed.

So allow up to N milliseconds pause time for (4*N) MB bdi dirty limit.
On a typical 4GB desktop, the max pause time will be ~150ms.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:15.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
@@ -643,6 +643,22 @@ unlock:
 }
 
 /*
+ * Limit pause time for small memory systems. If sleeping for too long time,
+ * the small pool of dirty/writeback pages may go empty and disk go idle.
+ */
+static unsigned long max_pause(unsigned long bdi_thresh)
+{
+	unsigned long t;
+
+	/* 1ms for every 4MB */
+	t = bdi_thresh >> (32 - PAGE_CACHE_SHIFT -
+			   ilog2(roundup_pow_of_two(HZ)));
+	t += 2;
+
+	return min_t(unsigned long, t, MAX_PAUSE);
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -663,6 +679,7 @@ static void balance_dirty_pages(struct a
 	unsigned long long bw;
 	unsigned long period;
 	unsigned long pause = 0;
+	unsigned long pause_max;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
@@ -715,8 +732,10 @@ static void balance_dirty_pages(struct a
 		if (avg_dirty < bdi_dirty || avg_dirty > task_thresh)
 			avg_dirty = bdi_dirty;
 
+		pause_max = max_pause(bdi_thresh);
+
 		if (avg_dirty >= task_thresh || nr_dirty > dirty_thresh) {
-			pause = MAX_PAUSE;
+			pause = pause_max;
 			goto pause;
 		}
 
@@ -750,7 +769,7 @@ static void balance_dirty_pages(struct a
 			pause = 1;
 			break;
 		}
-		pause = clamp_val(pause, 1, MAX_PAUSE);
+		pause = clamp_val(pause, 1, pause_max);
 
 pause:
 		current->paused_when = jiffies;
@@ -781,7 +800,7 @@ pause:
 		current->nr_dirtied_pause = ratelimit_pages(bdi);
 	else if (pause == 1)
 		current->nr_dirtied_pause += current->nr_dirtied_pause / 32 + 1;
-	else if (pause >= MAX_PAUSE)
+	else if (pause >= pause_max)
 		/*
 		 * when repeated, writing 1 page per 100ms on slow devices,
 		 * i-(i+2)/4 will be able to reach 1 but never reduce to 0.



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 15/35] writeback: adapt max balance pause time to memory size
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-max-pause-time-for-small-memory-system.patch --]
[-- Type: text/plain, Size: 2915 bytes --]

For small memory systems, sleeping for 200ms at a time is an overkill.
Given 4MB dirty limit, all the dirty/writeback pages will be written to
a 80MB/s disk within 50ms. If the task goes sleep for 200ms after it
dirtied 4MB, the disk will go idle for 150ms without any new data feed.

So allow up to N milliseconds pause time for (4*N) MB bdi dirty limit.
On a typical 4GB desktop, the max pause time will be ~150ms.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:15.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
@@ -643,6 +643,22 @@ unlock:
 }
 
 /*
+ * Limit pause time for small memory systems. If sleeping for too long time,
+ * the small pool of dirty/writeback pages may go empty and disk go idle.
+ */
+static unsigned long max_pause(unsigned long bdi_thresh)
+{
+	unsigned long t;
+
+	/* 1ms for every 4MB */
+	t = bdi_thresh >> (32 - PAGE_CACHE_SHIFT -
+			   ilog2(roundup_pow_of_two(HZ)));
+	t += 2;
+
+	return min_t(unsigned long, t, MAX_PAUSE);
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -663,6 +679,7 @@ static void balance_dirty_pages(struct a
 	unsigned long long bw;
 	unsigned long period;
 	unsigned long pause = 0;
+	unsigned long pause_max;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
@@ -715,8 +732,10 @@ static void balance_dirty_pages(struct a
 		if (avg_dirty < bdi_dirty || avg_dirty > task_thresh)
 			avg_dirty = bdi_dirty;
 
+		pause_max = max_pause(bdi_thresh);
+
 		if (avg_dirty >= task_thresh || nr_dirty > dirty_thresh) {
-			pause = MAX_PAUSE;
+			pause = pause_max;
 			goto pause;
 		}
 
@@ -750,7 +769,7 @@ static void balance_dirty_pages(struct a
 			pause = 1;
 			break;
 		}
-		pause = clamp_val(pause, 1, MAX_PAUSE);
+		pause = clamp_val(pause, 1, pause_max);
 
 pause:
 		current->paused_when = jiffies;
@@ -781,7 +800,7 @@ pause:
 		current->nr_dirtied_pause = ratelimit_pages(bdi);
 	else if (pause == 1)
 		current->nr_dirtied_pause += current->nr_dirtied_pause / 32 + 1;
-	else if (pause >= MAX_PAUSE)
+	else if (pause >= pause_max)
 		/*
 		 * when repeated, writing 1 page per 100ms on slow devices,
 		 * i-(i+2)/4 will be able to reach 1 but never reduce to 0.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 15/35] writeback: adapt max balance pause time to memory size
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-max-pause-time-for-small-memory-system.patch --]
[-- Type: text/plain, Size: 2915 bytes --]

For small memory systems, sleeping for 200ms at a time is an overkill.
Given 4MB dirty limit, all the dirty/writeback pages will be written to
a 80MB/s disk within 50ms. If the task goes sleep for 200ms after it
dirtied 4MB, the disk will go idle for 150ms without any new data feed.

So allow up to N milliseconds pause time for (4*N) MB bdi dirty limit.
On a typical 4GB desktop, the max pause time will be ~150ms.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:15.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
@@ -643,6 +643,22 @@ unlock:
 }
 
 /*
+ * Limit pause time for small memory systems. If sleeping for too long time,
+ * the small pool of dirty/writeback pages may go empty and disk go idle.
+ */
+static unsigned long max_pause(unsigned long bdi_thresh)
+{
+	unsigned long t;
+
+	/* 1ms for every 4MB */
+	t = bdi_thresh >> (32 - PAGE_CACHE_SHIFT -
+			   ilog2(roundup_pow_of_two(HZ)));
+	t += 2;
+
+	return min_t(unsigned long, t, MAX_PAUSE);
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -663,6 +679,7 @@ static void balance_dirty_pages(struct a
 	unsigned long long bw;
 	unsigned long period;
 	unsigned long pause = 0;
+	unsigned long pause_max;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
@@ -715,8 +732,10 @@ static void balance_dirty_pages(struct a
 		if (avg_dirty < bdi_dirty || avg_dirty > task_thresh)
 			avg_dirty = bdi_dirty;
 
+		pause_max = max_pause(bdi_thresh);
+
 		if (avg_dirty >= task_thresh || nr_dirty > dirty_thresh) {
-			pause = MAX_PAUSE;
+			pause = pause_max;
 			goto pause;
 		}
 
@@ -750,7 +769,7 @@ static void balance_dirty_pages(struct a
 			pause = 1;
 			break;
 		}
-		pause = clamp_val(pause, 1, MAX_PAUSE);
+		pause = clamp_val(pause, 1, pause_max);
 
 pause:
 		current->paused_when = jiffies;
@@ -781,7 +800,7 @@ pause:
 		current->nr_dirtied_pause = ratelimit_pages(bdi);
 	else if (pause == 1)
 		current->nr_dirtied_pause += current->nr_dirtied_pause / 32 + 1;
-	else if (pause >= MAX_PAUSE)
+	else if (pause >= pause_max)
 		/*
 		 * when repeated, writing 1 page per 100ms on slow devices,
 		 * i-(i+2)/4 will be able to reach 1 but never reduce to 0.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 16/35] writeback: increase min pause time on concurrent dirtiers
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:47   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-min-pause-time-for-concurrent-dirtiers.patch --]
[-- Type: text/plain, Size: 2104 bytes --]

Target for >60ms pause time when there are 100+ heavy dirtiers per bdi.
(will average around 100ms given 200ms max pause time)

It's OK for 1 dd task doing 100MB/s to be throttle paused 100 times per
second.  However when there are 100 tasks writing to the same disk,
That sums up to 100*100 balance_dirty_pages() calls per second and may
lead to massive cacheline bouncing on accessing the global page states
in NUMA machines.  Even in single socket boxes, we easily see >10% CPU
time reduction by increasing the pause time.

CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
@@ -659,6 +659,27 @@ static unsigned long max_pause(unsigned 
 }
 
 /*
+ * Scale up pause time for concurrent dirtiers in order to reduce CPU overheads.
+ * But ensure reasonably large [min_pause, max_pause] range size, so that
+ * nr_dirtied_pause (and hence future pause time) can stay reasonably stable.
+ */
+static unsigned long min_pause(struct backing_dev_info *bdi,
+			       unsigned long max)
+{
+	unsigned long hi = ilog2(bdi->write_bandwidth);
+	unsigned long lo = ilog2(bdi->throttle_bandwidth);
+	unsigned long t;
+
+	if (lo >= hi)
+		return 1;
+
+	/* (N * 10ms) on 2^N concurrent tasks */
+	t = (hi - lo) * (10 * HZ) / 1024;
+
+	return clamp_val(t, 1, max / 2);
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -798,7 +819,7 @@ pause:
 
 	if (pause == 0 && nr_dirty < background_thresh)
 		current->nr_dirtied_pause = ratelimit_pages(bdi);
-	else if (pause == 1)
+	else if (pause <= min_pause(bdi, pause_max))
 		current->nr_dirtied_pause += current->nr_dirtied_pause / 32 + 1;
 	else if (pause >= pause_max)
 		/*



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 16/35] writeback: increase min pause time on concurrent dirtiers
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-min-pause-time-for-concurrent-dirtiers.patch --]
[-- Type: text/plain, Size: 2400 bytes --]

Target for >60ms pause time when there are 100+ heavy dirtiers per bdi.
(will average around 100ms given 200ms max pause time)

It's OK for 1 dd task doing 100MB/s to be throttle paused 100 times per
second.  However when there are 100 tasks writing to the same disk,
That sums up to 100*100 balance_dirty_pages() calls per second and may
lead to massive cacheline bouncing on accessing the global page states
in NUMA machines.  Even in single socket boxes, we easily see >10% CPU
time reduction by increasing the pause time.

CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
@@ -659,6 +659,27 @@ static unsigned long max_pause(unsigned 
 }
 
 /*
+ * Scale up pause time for concurrent dirtiers in order to reduce CPU overheads.
+ * But ensure reasonably large [min_pause, max_pause] range size, so that
+ * nr_dirtied_pause (and hence future pause time) can stay reasonably stable.
+ */
+static unsigned long min_pause(struct backing_dev_info *bdi,
+			       unsigned long max)
+{
+	unsigned long hi = ilog2(bdi->write_bandwidth);
+	unsigned long lo = ilog2(bdi->throttle_bandwidth);
+	unsigned long t;
+
+	if (lo >= hi)
+		return 1;
+
+	/* (N * 10ms) on 2^N concurrent tasks */
+	t = (hi - lo) * (10 * HZ) / 1024;
+
+	return clamp_val(t, 1, max / 2);
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -798,7 +819,7 @@ pause:
 
 	if (pause == 0 && nr_dirty < background_thresh)
 		current->nr_dirtied_pause = ratelimit_pages(bdi);
-	else if (pause == 1)
+	else if (pause <= min_pause(bdi, pause_max))
 		current->nr_dirtied_pause += current->nr_dirtied_pause / 32 + 1;
 	else if (pause >= pause_max)
 		/*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 16/35] writeback: increase min pause time on concurrent dirtiers
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-min-pause-time-for-concurrent-dirtiers.patch --]
[-- Type: text/plain, Size: 2400 bytes --]

Target for >60ms pause time when there are 100+ heavy dirtiers per bdi.
(will average around 100ms given 200ms max pause time)

It's OK for 1 dd task doing 100MB/s to be throttle paused 100 times per
second.  However when there are 100 tasks writing to the same disk,
That sums up to 100*100 balance_dirty_pages() calls per second and may
lead to massive cacheline bouncing on accessing the global page states
in NUMA machines.  Even in single socket boxes, we easily see >10% CPU
time reduction by increasing the pause time.

CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
@@ -659,6 +659,27 @@ static unsigned long max_pause(unsigned 
 }
 
 /*
+ * Scale up pause time for concurrent dirtiers in order to reduce CPU overheads.
+ * But ensure reasonably large [min_pause, max_pause] range size, so that
+ * nr_dirtied_pause (and hence future pause time) can stay reasonably stable.
+ */
+static unsigned long min_pause(struct backing_dev_info *bdi,
+			       unsigned long max)
+{
+	unsigned long hi = ilog2(bdi->write_bandwidth);
+	unsigned long lo = ilog2(bdi->throttle_bandwidth);
+	unsigned long t;
+
+	if (lo >= hi)
+		return 1;
+
+	/* (N * 10ms) on 2^N concurrent tasks */
+	t = (hi - lo) * (10 * HZ) / 1024;
+
+	return clamp_val(t, 1, max / 2);
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -798,7 +819,7 @@ pause:
 
 	if (pause == 0 && nr_dirty < background_thresh)
 		current->nr_dirtied_pause = ratelimit_pages(bdi);
-	else if (pause == 1)
+	else if (pause <= min_pause(bdi, pause_max))
 		current->nr_dirtied_pause += current->nr_dirtied_pause / 32 + 1;
 	else if (pause >= pause_max)
 		/*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 17/35] writeback: quit throttling when bdi dirty pages dropped low
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:47   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bdi-throttle-break.patch --]
[-- Type: text/plain, Size: 2663 bytes --]

Tests show that bdi_thresh may take minutes to ramp up on a typical
desktop. The time should be improvable but cannot be eliminated totally.
So when (background_thresh + dirty_thresh)/2 is reached and
balance_dirty_pages() starts to throttle the task, it will suddenly find
the (still low and ramping up) bdi_thresh is exceeded _excessively_. Here
we definitely don't want to stall the task for one minute (when it's
writing to USB stick). So introduce an alternative way to break out of
the loop when the bdi dirty/write pages has dropped by a reasonable
amount.

It will at least pause for one loop before trying to break out.

The break is designed mainly to help the single task case. The break
threshold is time for writing 125ms data, so that when the task slept
for MAX_PAUSE=200ms, it will have good chance to break out. For NFS
there may be only 1-2 completions of large COMMIT per second, in which
case the task may still get stuck for 1s.

Note that this opens the chance that during normal operation, a huge
number of slow dirtiers writing to a really slow device might manage to
outrun bdi_thresh. But the risk is pretty low.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
@@ -693,6 +693,7 @@ static void balance_dirty_pages(struct a
 	long nr_dirty;
 	long bdi_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	long avg_dirty;  /* smoothed bdi_dirty */
+	long bdi_prev_dirty = 0;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
@@ -749,6 +750,24 @@ static void balance_dirty_pages(struct a
 
 		bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
 
+		/*
+		 * bdi_thresh takes time to ramp up from the initial 0,
+		 * especially for slow devices.
+		 *
+		 * It's possible that at the moment dirty throttling starts,
+		 *	bdi_dirty = nr_dirty
+		 *		  = (background_thresh + dirty_thresh) / 2
+		 *		  >> bdi_thresh
+		 * Then the task could be blocked for many seconds to flush all
+		 * the exceeded (bdi_dirty - bdi_thresh) pages. So offer a
+		 * complementary way to break out of the loop when 125ms worth
+		 * of dirty pages have been cleaned during our pause time.
+		 */
+		if (nr_dirty <= dirty_thresh &&
+		    bdi_prev_dirty - bdi_dirty > (long)bdi->write_bandwidth / 8)
+			break;
+		bdi_prev_dirty = bdi_dirty;
+
 		avg_dirty = bdi->avg_dirty;
 		if (avg_dirty < bdi_dirty || avg_dirty > task_thresh)
 			avg_dirty = bdi_dirty;



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 17/35] writeback: quit throttling when bdi dirty pages dropped low
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bdi-throttle-break.patch --]
[-- Type: text/plain, Size: 2959 bytes --]

Tests show that bdi_thresh may take minutes to ramp up on a typical
desktop. The time should be improvable but cannot be eliminated totally.
So when (background_thresh + dirty_thresh)/2 is reached and
balance_dirty_pages() starts to throttle the task, it will suddenly find
the (still low and ramping up) bdi_thresh is exceeded _excessively_. Here
we definitely don't want to stall the task for one minute (when it's
writing to USB stick). So introduce an alternative way to break out of
the loop when the bdi dirty/write pages has dropped by a reasonable
amount.

It will at least pause for one loop before trying to break out.

The break is designed mainly to help the single task case. The break
threshold is time for writing 125ms data, so that when the task slept
for MAX_PAUSE=200ms, it will have good chance to break out. For NFS
there may be only 1-2 completions of large COMMIT per second, in which
case the task may still get stuck for 1s.

Note that this opens the chance that during normal operation, a huge
number of slow dirtiers writing to a really slow device might manage to
outrun bdi_thresh. But the risk is pretty low.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
@@ -693,6 +693,7 @@ static void balance_dirty_pages(struct a
 	long nr_dirty;
 	long bdi_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	long avg_dirty;  /* smoothed bdi_dirty */
+	long bdi_prev_dirty = 0;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
@@ -749,6 +750,24 @@ static void balance_dirty_pages(struct a
 
 		bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
 
+		/*
+		 * bdi_thresh takes time to ramp up from the initial 0,
+		 * especially for slow devices.
+		 *
+		 * It's possible that at the moment dirty throttling starts,
+		 *	bdi_dirty = nr_dirty
+		 *		  = (background_thresh + dirty_thresh) / 2
+		 *		  >> bdi_thresh
+		 * Then the task could be blocked for many seconds to flush all
+		 * the exceeded (bdi_dirty - bdi_thresh) pages. So offer a
+		 * complementary way to break out of the loop when 125ms worth
+		 * of dirty pages have been cleaned during our pause time.
+		 */
+		if (nr_dirty <= dirty_thresh &&
+		    bdi_prev_dirty - bdi_dirty > (long)bdi->write_bandwidth / 8)
+			break;
+		bdi_prev_dirty = bdi_dirty;
+
 		avg_dirty = bdi->avg_dirty;
 		if (avg_dirty < bdi_dirty || avg_dirty > task_thresh)
 			avg_dirty = bdi_dirty;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 17/35] writeback: quit throttling when bdi dirty pages dropped low
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bdi-throttle-break.patch --]
[-- Type: text/plain, Size: 2959 bytes --]

Tests show that bdi_thresh may take minutes to ramp up on a typical
desktop. The time should be improvable but cannot be eliminated totally.
So when (background_thresh + dirty_thresh)/2 is reached and
balance_dirty_pages() starts to throttle the task, it will suddenly find
the (still low and ramping up) bdi_thresh is exceeded _excessively_. Here
we definitely don't want to stall the task for one minute (when it's
writing to USB stick). So introduce an alternative way to break out of
the loop when the bdi dirty/write pages has dropped by a reasonable
amount.

It will at least pause for one loop before trying to break out.

The break is designed mainly to help the single task case. The break
threshold is time for writing 125ms data, so that when the task slept
for MAX_PAUSE=200ms, it will have good chance to break out. For NFS
there may be only 1-2 completions of large COMMIT per second, in which
case the task may still get stuck for 1s.

Note that this opens the chance that during normal operation, a huge
number of slow dirtiers writing to a really slow device might manage to
outrun bdi_thresh. But the risk is pretty low.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
@@ -693,6 +693,7 @@ static void balance_dirty_pages(struct a
 	long nr_dirty;
 	long bdi_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	long avg_dirty;  /* smoothed bdi_dirty */
+	long bdi_prev_dirty = 0;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
@@ -749,6 +750,24 @@ static void balance_dirty_pages(struct a
 
 		bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
 
+		/*
+		 * bdi_thresh takes time to ramp up from the initial 0,
+		 * especially for slow devices.
+		 *
+		 * It's possible that at the moment dirty throttling starts,
+		 *	bdi_dirty = nr_dirty
+		 *		  = (background_thresh + dirty_thresh) / 2
+		 *		  >> bdi_thresh
+		 * Then the task could be blocked for many seconds to flush all
+		 * the exceeded (bdi_dirty - bdi_thresh) pages. So offer a
+		 * complementary way to break out of the loop when 125ms worth
+		 * of dirty pages have been cleaned during our pause time.
+		 */
+		if (nr_dirty <= dirty_thresh &&
+		    bdi_prev_dirty - bdi_dirty > (long)bdi->write_bandwidth / 8)
+			break;
+		bdi_prev_dirty = bdi_dirty;
+
 		avg_dirty = bdi->avg_dirty;
 		if (avg_dirty < bdi_dirty || avg_dirty > task_thresh)
 			avg_dirty = bdi_dirty;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 18/35] writeback: start background writeback earlier
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:47   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-kick-background-early.patch --]
[-- Type: text/plain, Size: 1264 bytes --]

It's possible for some one to suddenly eat lots of memory,
leading to sudden drop of global dirty limit. So a dirtier
task may get hard throttled immediately without some previous
balance_dirty_pages() call to invoke background writeback.

In this case we need to check for background writeback earlier in the
loop to avoid stucking the application for very long time. This was not
a problem before the IO-less balance_dirty_pages() because it will try
to write something and then break out of the loop regardless of the
global limit.

Another scheme this check will help is, the dirty limit is too close to
the background threshold, so that someone manages to jump directly into
the pause threshold (background+dirty)/2.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    3 +++
 1 file changed, 3 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:17.000000000 +0800
@@ -748,6 +748,9 @@ static void balance_dirty_pages(struct a
 				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
+		if (unlikely(!writeback_in_progress(bdi)))
+			bdi_start_background_writeback(bdi);
+
 		bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
 
 		/*



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 18/35] writeback: start background writeback earlier
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-kick-background-early.patch --]
[-- Type: text/plain, Size: 1560 bytes --]

It's possible for some one to suddenly eat lots of memory,
leading to sudden drop of global dirty limit. So a dirtier
task may get hard throttled immediately without some previous
balance_dirty_pages() call to invoke background writeback.

In this case we need to check for background writeback earlier in the
loop to avoid stucking the application for very long time. This was not
a problem before the IO-less balance_dirty_pages() because it will try
to write something and then break out of the loop regardless of the
global limit.

Another scheme this check will help is, the dirty limit is too close to
the background threshold, so that someone manages to jump directly into
the pause threshold (background+dirty)/2.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    3 +++
 1 file changed, 3 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:17.000000000 +0800
@@ -748,6 +748,9 @@ static void balance_dirty_pages(struct a
 				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
+		if (unlikely(!writeback_in_progress(bdi)))
+			bdi_start_background_writeback(bdi);
+
 		bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
 
 		/*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 18/35] writeback: start background writeback earlier
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-kick-background-early.patch --]
[-- Type: text/plain, Size: 1560 bytes --]

It's possible for some one to suddenly eat lots of memory,
leading to sudden drop of global dirty limit. So a dirtier
task may get hard throttled immediately without some previous
balance_dirty_pages() call to invoke background writeback.

In this case we need to check for background writeback earlier in the
loop to avoid stucking the application for very long time. This was not
a problem before the IO-less balance_dirty_pages() because it will try
to write something and then break out of the loop regardless of the
global limit.

Another scheme this check will help is, the dirty limit is too close to
the background threshold, so that someone manages to jump directly into
the pause threshold (background+dirty)/2.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    3 +++
 1 file changed, 3 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:17.000000000 +0800
@@ -748,6 +748,9 @@ static void balance_dirty_pages(struct a
 				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
+		if (unlikely(!writeback_in_progress(bdi)))
+			bdi_start_background_writeback(bdi);
+
 		bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
 
 		/*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 19/35] writeback: make nr_to_write a per-file limit
  2010-12-13 14:46 ` Wu Fengguang
@ 2010-12-13 14:47   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-single-file-limit.patch --]
[-- Type: text/plain, Size: 1883 bytes --]

This ensures full 4MB (or larger) writeback size for large dirty files.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   11 +++++++++++
 include/linux/writeback.h |    1 +
 2 files changed, 12 insertions(+)

--- linux-next.orig/fs/fs-writeback.c	2010-12-13 21:46:14.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-12-13 21:46:17.000000000 +0800
@@ -330,6 +330,8 @@ static int
 writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 {
 	struct address_space *mapping = inode->i_mapping;
+	long per_file_limit = wbc->per_file_limit;
+	long uninitialized_var(nr_to_write);
 	unsigned dirty;
 	int ret;
 
@@ -365,8 +367,16 @@ writeback_single_inode(struct inode *ino
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode_lock);
 
+	if (per_file_limit) {
+		nr_to_write = wbc->nr_to_write;
+		wbc->nr_to_write = per_file_limit;
+	}
+
 	ret = do_writepages(mapping, wbc);
 
+	if (per_file_limit)
+		wbc->nr_to_write += nr_to_write - per_file_limit;
+
 	/*
 	 * Make sure to wait on the data before writing out the metadata.
 	 * This is important for filesystems that modify metadata on data
@@ -696,6 +706,7 @@ static long wb_writeback(struct bdi_writ
 
 		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
+		wbc.per_file_limit = write_chunk;
 		wbc.pages_skipped = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
--- linux-next.orig/include/linux/writeback.h	2010-12-13 21:46:14.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-12-13 21:46:17.000000000 +0800
@@ -43,6 +43,7 @@ struct writeback_control {
 					   extra jobs and livelock */
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
+	long per_file_limit;		/* Write this many pages for one file */
 	long pages_skipped;		/* Pages which were not written */
 
 	/*



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 19/35] writeback: make nr_to_write a per-file limit
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-single-file-limit.patch --]
[-- Type: text/plain, Size: 1883 bytes --]

This ensures full 4MB (or larger) writeback size for large dirty files.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   11 +++++++++++
 include/linux/writeback.h |    1 +
 2 files changed, 12 insertions(+)

--- linux-next.orig/fs/fs-writeback.c	2010-12-13 21:46:14.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-12-13 21:46:17.000000000 +0800
@@ -330,6 +330,8 @@ static int
 writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 {
 	struct address_space *mapping = inode->i_mapping;
+	long per_file_limit = wbc->per_file_limit;
+	long uninitialized_var(nr_to_write);
 	unsigned dirty;
 	int ret;
 
@@ -365,8 +367,16 @@ writeback_single_inode(struct inode *ino
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode_lock);
 
+	if (per_file_limit) {
+		nr_to_write = wbc->nr_to_write;
+		wbc->nr_to_write = per_file_limit;
+	}
+
 	ret = do_writepages(mapping, wbc);
 
+	if (per_file_limit)
+		wbc->nr_to_write += nr_to_write - per_file_limit;
+
 	/*
 	 * Make sure to wait on the data before writing out the metadata.
 	 * This is important for filesystems that modify metadata on data
@@ -696,6 +706,7 @@ static long wb_writeback(struct bdi_writ
 
 		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
+		wbc.per_file_limit = write_chunk;
 		wbc.pages_skipped = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
--- linux-next.orig/include/linux/writeback.h	2010-12-13 21:46:14.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-12-13 21:46:17.000000000 +0800
@@ -43,6 +43,7 @@ struct writeback_control {
 					   extra jobs and livelock */
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
+	long per_file_limit;		/* Write this many pages for one file */
 	long pages_skipped;		/* Pages which were not written */
 
 	/*



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 20/35] writeback: scale IO chunk size up to device bandwidth
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:47   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Theodore Tso, Dave Chinner, Chris Mason,
	Peter Zijlstra, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-128M-MAX_WRITEBACK_PAGES.patch --]
[-- Type: text/plain, Size: 4961 bytes --]

Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
concern of not holding I_SYNC for too long.  (At least, that was the
comment previously.)  This doesn't make sense now because the only
time we wait for I_SYNC is if we are calling sync or fsync, and in
that case we need to write out all of the data anyway.  Previously
there may have been other code paths that waited on I_SYNC, but not
any more.					    -- Theodore Ts'o

According to Christoph, the current writeback size is way too small,
and XFS had a hack that bumped out nr_to_write to four times the value
sent by the VM to be able to saturate medium-sized RAID arrays.  This
value was also problematic for ext4 as well, as it caused large files
to be come interleaved on disk by in 8 megabyte chunks (we bumped up
the nr_to_write by a factor of two).

So remove the MAX_WRITEBACK_PAGES constraint totally. The writeback pages
will adapt to as large as the storage device can write within 1 second.

For a typical hard disk, the resulted chunk size will be 32MB or 64MB.

http://bugzilla.kernel.org/show_bug.cgi?id=13930

CC: Theodore Ts'o <tytso@mit.edu>
CC: Dave Chinner <david@fromorbit.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   60 +++++++++++++++++++-----------------
 include/linux/writeback.h |    5 +++
 2 files changed, 38 insertions(+), 27 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-12-13 21:46:17.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-12-13 21:46:17.000000000 +0800
@@ -600,15 +600,6 @@ static void __writeback_inodes_sb(struct
 	spin_unlock(&inode_lock);
 }
 
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES     1024
-
 static inline bool over_bground_thresh(void)
 {
 	unsigned long background_thresh, dirty_thresh;
@@ -620,6 +611,38 @@ static inline bool over_bground_thresh(v
 }
 
 /*
+ * Give each inode a nr_to_write that can complete within 1 second.
+ */
+static unsigned long writeback_chunk_size(struct backing_dev_info *bdi,
+					  int sync_mode)
+{
+	unsigned long pages;
+
+	/*
+	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
+	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
+	 * here avoids calling into writeback_inodes_wb() more than once.
+	 *
+	 * The intended call sequence for WB_SYNC_ALL writeback is:
+	 *
+	 *      wb_writeback()
+	 *          __writeback_inodes_sb()     <== called only once
+	 *              write_cache_pages()     <== called once for each inode
+	 *                  (quickly) tag currently dirty pages
+	 *                  (maybe slowly) sync all tagged pages
+	 */
+	if (sync_mode == WB_SYNC_ALL)
+		return LONG_MAX;
+
+	pages = bdi->write_bandwidth;
+
+	if (pages < MIN_WRITEBACK_PAGES)
+		return MIN_WRITEBACK_PAGES;
+
+	return rounddown_pow_of_two(pages);
+}
+
+/*
  * Explicit flushing or periodic writeback of "old" data.
  *
  * Define "old": the first time one of an inode's pages is dirtied, we mark the
@@ -659,24 +682,6 @@ static long wb_writeback(struct bdi_writ
 		wbc.range_end = LLONG_MAX;
 	}
 
-	/*
-	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
-	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
-	 * here avoids calling into writeback_inodes_wb() more than once.
-	 *
-	 * The intended call sequence for WB_SYNC_ALL writeback is:
-	 *
-	 *      wb_writeback()
-	 *          __writeback_inodes_sb()     <== called only once
-	 *              write_cache_pages()     <== called once for each inode
-	 *                   (quickly) tag currently dirty pages
-	 *                   (maybe slowly) sync all tagged pages
-	 */
-	if (wbc.sync_mode == WB_SYNC_NONE)
-		write_chunk = MAX_WRITEBACK_PAGES;
-	else
-		write_chunk = LONG_MAX;
-
 	wbc.wb_start = jiffies; /* livelock avoidance */
 	bdi_update_write_bandwidth(wb->bdi, wbc.wb_start);
 
@@ -705,6 +710,7 @@ static long wb_writeback(struct bdi_writ
 			break;
 
 		wbc.more_io = 0;
+		write_chunk = writeback_chunk_size(wb->bdi, wbc.sync_mode);
 		wbc.nr_to_write = write_chunk;
 		wbc.per_file_limit = write_chunk;
 		wbc.pages_skipped = 0;
--- linux-next.orig/include/linux/writeback.h	2010-12-13 21:46:17.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-12-13 21:46:17.000000000 +0800
@@ -22,6 +22,11 @@ extern spinlock_t inode_lock;
 #define TASK_SOFT_DIRTY_LIMIT	(BDI_SOFT_DIRTY_LIMIT * 2)
 
 /*
+ * 4MB minimal write chunk size
+ */
+#define MIN_WRITEBACK_PAGES     (4096 >> (PAGE_CACHE_SHIFT - 10))
+
+/*
  * fs/fs-writeback.c
  */
 enum writeback_sync_modes {



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 20/35] writeback: scale IO chunk size up to device bandwidth
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Theodore Tso, Dave Chinner, Chris Mason,
	Peter Zijlstra, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-128M-MAX_WRITEBACK_PAGES.patch --]
[-- Type: text/plain, Size: 5257 bytes --]

Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
concern of not holding I_SYNC for too long.  (At least, that was the
comment previously.)  This doesn't make sense now because the only
time we wait for I_SYNC is if we are calling sync or fsync, and in
that case we need to write out all of the data anyway.  Previously
there may have been other code paths that waited on I_SYNC, but not
any more.					    -- Theodore Ts'o

According to Christoph, the current writeback size is way too small,
and XFS had a hack that bumped out nr_to_write to four times the value
sent by the VM to be able to saturate medium-sized RAID arrays.  This
value was also problematic for ext4 as well, as it caused large files
to be come interleaved on disk by in 8 megabyte chunks (we bumped up
the nr_to_write by a factor of two).

So remove the MAX_WRITEBACK_PAGES constraint totally. The writeback pages
will adapt to as large as the storage device can write within 1 second.

For a typical hard disk, the resulted chunk size will be 32MB or 64MB.

http://bugzilla.kernel.org/show_bug.cgi?id=13930

CC: Theodore Ts'o <tytso@mit.edu>
CC: Dave Chinner <david@fromorbit.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   60 +++++++++++++++++++-----------------
 include/linux/writeback.h |    5 +++
 2 files changed, 38 insertions(+), 27 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-12-13 21:46:17.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-12-13 21:46:17.000000000 +0800
@@ -600,15 +600,6 @@ static void __writeback_inodes_sb(struct
 	spin_unlock(&inode_lock);
 }
 
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES     1024
-
 static inline bool over_bground_thresh(void)
 {
 	unsigned long background_thresh, dirty_thresh;
@@ -620,6 +611,38 @@ static inline bool over_bground_thresh(v
 }
 
 /*
+ * Give each inode a nr_to_write that can complete within 1 second.
+ */
+static unsigned long writeback_chunk_size(struct backing_dev_info *bdi,
+					  int sync_mode)
+{
+	unsigned long pages;
+
+	/*
+	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
+	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
+	 * here avoids calling into writeback_inodes_wb() more than once.
+	 *
+	 * The intended call sequence for WB_SYNC_ALL writeback is:
+	 *
+	 *      wb_writeback()
+	 *          __writeback_inodes_sb()     <== called only once
+	 *              write_cache_pages()     <== called once for each inode
+	 *                  (quickly) tag currently dirty pages
+	 *                  (maybe slowly) sync all tagged pages
+	 */
+	if (sync_mode == WB_SYNC_ALL)
+		return LONG_MAX;
+
+	pages = bdi->write_bandwidth;
+
+	if (pages < MIN_WRITEBACK_PAGES)
+		return MIN_WRITEBACK_PAGES;
+
+	return rounddown_pow_of_two(pages);
+}
+
+/*
  * Explicit flushing or periodic writeback of "old" data.
  *
  * Define "old": the first time one of an inode's pages is dirtied, we mark the
@@ -659,24 +682,6 @@ static long wb_writeback(struct bdi_writ
 		wbc.range_end = LLONG_MAX;
 	}
 
-	/*
-	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
-	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
-	 * here avoids calling into writeback_inodes_wb() more than once.
-	 *
-	 * The intended call sequence for WB_SYNC_ALL writeback is:
-	 *
-	 *      wb_writeback()
-	 *          __writeback_inodes_sb()     <== called only once
-	 *              write_cache_pages()     <== called once for each inode
-	 *                   (quickly) tag currently dirty pages
-	 *                   (maybe slowly) sync all tagged pages
-	 */
-	if (wbc.sync_mode == WB_SYNC_NONE)
-		write_chunk = MAX_WRITEBACK_PAGES;
-	else
-		write_chunk = LONG_MAX;
-
 	wbc.wb_start = jiffies; /* livelock avoidance */
 	bdi_update_write_bandwidth(wb->bdi, wbc.wb_start);
 
@@ -705,6 +710,7 @@ static long wb_writeback(struct bdi_writ
 			break;
 
 		wbc.more_io = 0;
+		write_chunk = writeback_chunk_size(wb->bdi, wbc.sync_mode);
 		wbc.nr_to_write = write_chunk;
 		wbc.per_file_limit = write_chunk;
 		wbc.pages_skipped = 0;
--- linux-next.orig/include/linux/writeback.h	2010-12-13 21:46:17.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-12-13 21:46:17.000000000 +0800
@@ -22,6 +22,11 @@ extern spinlock_t inode_lock;
 #define TASK_SOFT_DIRTY_LIMIT	(BDI_SOFT_DIRTY_LIMIT * 2)
 
 /*
+ * 4MB minimal write chunk size
+ */
+#define MIN_WRITEBACK_PAGES     (4096 >> (PAGE_CACHE_SHIFT - 10))
+
+/*
  * fs/fs-writeback.c
  */
 enum writeback_sync_modes {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 20/35] writeback: scale IO chunk size up to device bandwidth
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Theodore Tso, Dave Chinner, Chris Mason,
	Peter Zijlstra, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-128M-MAX_WRITEBACK_PAGES.patch --]
[-- Type: text/plain, Size: 5257 bytes --]

Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
concern of not holding I_SYNC for too long.  (At least, that was the
comment previously.)  This doesn't make sense now because the only
time we wait for I_SYNC is if we are calling sync or fsync, and in
that case we need to write out all of the data anyway.  Previously
there may have been other code paths that waited on I_SYNC, but not
any more.					    -- Theodore Ts'o

According to Christoph, the current writeback size is way too small,
and XFS had a hack that bumped out nr_to_write to four times the value
sent by the VM to be able to saturate medium-sized RAID arrays.  This
value was also problematic for ext4 as well, as it caused large files
to be come interleaved on disk by in 8 megabyte chunks (we bumped up
the nr_to_write by a factor of two).

So remove the MAX_WRITEBACK_PAGES constraint totally. The writeback pages
will adapt to as large as the storage device can write within 1 second.

For a typical hard disk, the resulted chunk size will be 32MB or 64MB.

http://bugzilla.kernel.org/show_bug.cgi?id=13930

CC: Theodore Ts'o <tytso@mit.edu>
CC: Dave Chinner <david@fromorbit.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   60 +++++++++++++++++++-----------------
 include/linux/writeback.h |    5 +++
 2 files changed, 38 insertions(+), 27 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-12-13 21:46:17.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-12-13 21:46:17.000000000 +0800
@@ -600,15 +600,6 @@ static void __writeback_inodes_sb(struct
 	spin_unlock(&inode_lock);
 }
 
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES     1024
-
 static inline bool over_bground_thresh(void)
 {
 	unsigned long background_thresh, dirty_thresh;
@@ -620,6 +611,38 @@ static inline bool over_bground_thresh(v
 }
 
 /*
+ * Give each inode a nr_to_write that can complete within 1 second.
+ */
+static unsigned long writeback_chunk_size(struct backing_dev_info *bdi,
+					  int sync_mode)
+{
+	unsigned long pages;
+
+	/*
+	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
+	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
+	 * here avoids calling into writeback_inodes_wb() more than once.
+	 *
+	 * The intended call sequence for WB_SYNC_ALL writeback is:
+	 *
+	 *      wb_writeback()
+	 *          __writeback_inodes_sb()     <== called only once
+	 *              write_cache_pages()     <== called once for each inode
+	 *                  (quickly) tag currently dirty pages
+	 *                  (maybe slowly) sync all tagged pages
+	 */
+	if (sync_mode == WB_SYNC_ALL)
+		return LONG_MAX;
+
+	pages = bdi->write_bandwidth;
+
+	if (pages < MIN_WRITEBACK_PAGES)
+		return MIN_WRITEBACK_PAGES;
+
+	return rounddown_pow_of_two(pages);
+}
+
+/*
  * Explicit flushing or periodic writeback of "old" data.
  *
  * Define "old": the first time one of an inode's pages is dirtied, we mark the
@@ -659,24 +682,6 @@ static long wb_writeback(struct bdi_writ
 		wbc.range_end = LLONG_MAX;
 	}
 
-	/*
-	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
-	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
-	 * here avoids calling into writeback_inodes_wb() more than once.
-	 *
-	 * The intended call sequence for WB_SYNC_ALL writeback is:
-	 *
-	 *      wb_writeback()
-	 *          __writeback_inodes_sb()     <== called only once
-	 *              write_cache_pages()     <== called once for each inode
-	 *                   (quickly) tag currently dirty pages
-	 *                   (maybe slowly) sync all tagged pages
-	 */
-	if (wbc.sync_mode == WB_SYNC_NONE)
-		write_chunk = MAX_WRITEBACK_PAGES;
-	else
-		write_chunk = LONG_MAX;
-
 	wbc.wb_start = jiffies; /* livelock avoidance */
 	bdi_update_write_bandwidth(wb->bdi, wbc.wb_start);
 
@@ -705,6 +710,7 @@ static long wb_writeback(struct bdi_writ
 			break;
 
 		wbc.more_io = 0;
+		write_chunk = writeback_chunk_size(wb->bdi, wbc.sync_mode);
 		wbc.nr_to_write = write_chunk;
 		wbc.per_file_limit = write_chunk;
 		wbc.pages_skipped = 0;
--- linux-next.orig/include/linux/writeback.h	2010-12-13 21:46:17.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-12-13 21:46:17.000000000 +0800
@@ -22,6 +22,11 @@ extern spinlock_t inode_lock;
 #define TASK_SOFT_DIRTY_LIMIT	(BDI_SOFT_DIRTY_LIMIT * 2)
 
 /*
+ * 4MB minimal write chunk size
+ */
+#define MIN_WRITEBACK_PAGES     (4096 >> (PAGE_CACHE_SHIFT - 10))
+
+/*
  * fs/fs-writeback.c
  */
 enum writeback_sync_modes {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 21/35] writeback: trace balance_dirty_pages()
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:47   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 7519 bytes --]

It would be useful for analyzing the dynamics of the throttling
algorithms, and helpful for debugging user reported problems.

Here is an interesting test to verify the theory with balance_dirty_pages()
tracing. On a partition that can do ~60MB/s, a sparse file is created and
4 rsync tasks with different write bandwidth started:

	dd if=/dev/zero of=/mnt/1T bs=1M count=1 seek=1024000
	echo 1 > /debug/tracing/events/writeback/balance_dirty_pages/enable

	rsync localhost:/mnt/1T /mnt/a --bwlimit 10000&
	rsync localhost:/mnt/1T /mnt/A --bwlimit 10000&
	rsync localhost:/mnt/1T /mnt/b --bwlimit 20000&
	rsync localhost:/mnt/1T /mnt/c --bwlimit 30000&

Trace outputs within 0.1 second, grouped by tasks:

rsync-3824  [004] 15002.076447: balance_dirty_pages: bdi=btrfs-2 weight=15% limit=130876 gap=5340 dirtied=192 pause=20

rsync-3822  [003] 15002.091701: balance_dirty_pages: bdi=btrfs-2 weight=15% limit=130777 gap=5113 dirtied=192 pause=20

rsync-3821  [006] 15002.004667: balance_dirty_pages: bdi=btrfs-2 weight=30% limit=129570 gap=3714 dirtied=64 pause=8
rsync-3821  [006] 15002.012654: balance_dirty_pages: bdi=btrfs-2 weight=30% limit=129589 gap=3733 dirtied=64 pause=8
rsync-3821  [006] 15002.021838: balance_dirty_pages: bdi=btrfs-2 weight=30% limit=129604 gap=3748 dirtied=64 pause=8
rsync-3821  [004] 15002.091193: balance_dirty_pages: bdi=btrfs-2 weight=29% limit=129583 gap=3983 dirtied=64 pause=8
rsync-3821  [004] 15002.102729: balance_dirty_pages: bdi=btrfs-2 weight=29% limit=129594 gap=3802 dirtied=64 pause=8
rsync-3821  [000] 15002.109252: balance_dirty_pages: bdi=btrfs-2 weight=29% limit=129619 gap=3827 dirtied=64 pause=8

rsync-3823  [002] 15002.009029: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128762 gap=2842 dirtied=64 pause=12
rsync-3823  [002] 15002.021598: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128813 gap=3021 dirtied=64 pause=12
rsync-3823  [003] 15002.032973: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128805 gap=2885 dirtied=64 pause=12
rsync-3823  [003] 15002.048800: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128823 gap=2967 dirtied=64 pause=12
rsync-3823  [003] 15002.060728: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128821 gap=3221 dirtied=64 pause=12
rsync-3823  [000] 15002.073152: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128825 gap=3225 dirtied=64 pause=12
rsync-3823  [005] 15002.090111: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128782 gap=3214 dirtied=64 pause=12
rsync-3823  [004] 15002.102520: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128764 gap=3036 dirtied=64 pause=12

The data vividly show that

- the heaviest writer is throttled a bit (weight=39%)

- the lighter writers run at full speed (weight=15%,15%,30%)

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   87 ++++++++++++++++++++++++++++-
 mm/page-writeback.c              |   20 ++++++
 2 files changed, 104 insertions(+), 3 deletions(-)

--- linux-next.orig/include/trace/events/writeback.h	2010-12-13 21:46:09.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-12-13 21:46:18.000000000 +0800
@@ -147,11 +147,92 @@ DEFINE_EVENT(wbc_class, name, \
 DEFINE_WBC_EVENT(wbc_writeback_start);
 DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
-DEFINE_WBC_EVENT(wbc_balance_dirty_start);
-DEFINE_WBC_EVENT(wbc_balance_dirty_written);
-DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+#define KBps(x)			((x) << (PAGE_SHIFT - 10))
+#define BDP_PERCENT(a, b, c)	(((__entry->a) - (__entry->b)) * 100 * (c) + \
+				  __entry->bdi_limit/2) / (__entry->bdi_limit|1)
+
+TRACE_EVENT(balance_dirty_pages,
+
+	TP_PROTO(struct backing_dev_info *bdi,
+		 long bdi_dirty,
+		 long avg_dirty,
+		 long bdi_limit,
+		 long task_limit,
+		 long dirtied,
+		 long task_bw,
+		 long period,
+		 long pause),
+
+	TP_ARGS(bdi, bdi_dirty, avg_dirty, bdi_limit, task_limit,
+		dirtied, task_bw, period, pause),
+
+	TP_STRUCT__entry(
+		__array(char,	bdi, 32)
+		__field(long,	bdi_dirty)
+		__field(long,	avg_dirty)
+		__field(long,	bdi_limit)
+		__field(long,	task_limit)
+		__field(long,	dirtied)
+		__field(long,	bdi_bw)
+		__field(long,	base_bw)
+		__field(long,	task_bw)
+		__field(long,	period)
+		__field(long,	think)
+		__field(long,	pause)
+	),
+
+	TP_fast_assign(
+		strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
+		__entry->bdi_dirty	= bdi_dirty;
+		__entry->avg_dirty	= avg_dirty;
+		__entry->bdi_limit	= bdi_limit;
+		__entry->task_limit	= task_limit;
+		__entry->dirtied	= dirtied;
+		__entry->bdi_bw		= KBps(bdi->write_bandwidth);
+		__entry->base_bw	= KBps(bdi->throttle_bandwidth);
+		__entry->task_bw	= KBps(task_bw);
+		__entry->think		= current->paused_when == 0 ? 0 :
+			 (long)(jiffies - current->paused_when) * 1000 / HZ;
+		__entry->period		= period * 1000 / HZ;
+		__entry->pause		= pause * 1000 / HZ;
+	),
+
+
+	/*
+	 *            [..............soft throttling range............]
+	 *            ^               |<=========== bdi_gap =========>|
+	 * (background+dirty)/2       |<== task_gap ==>|
+	 * -------------------|-------+----------------|--------------|
+	 *   (bdi_limit * 7/8)^       ^bdi_dirty       ^task_limit    ^bdi_limit
+	 *
+	 * Reasonable large gaps help produce smooth pause times.
+	 */
+	TP_printk("bdi %s: "
+		  "bdi_limit=%lu task_limit=%lu bdi_dirty=%lu avg_dirty=%lu "
+		  "bdi_gap=%ld%% task_gap=%ld%% task_weight=%ld%% "
+		  "bdi_bw=%lu base_bw=%lu task_bw=%lu "
+		  "dirtied=%lu period=%lu think=%ld pause=%ld",
+		  __entry->bdi,
+		  __entry->bdi_limit,
+		  __entry->task_limit,
+		  __entry->bdi_dirty,
+		  __entry->avg_dirty,
+		  BDP_PERCENT(bdi_limit, bdi_dirty, BDI_SOFT_DIRTY_LIMIT),
+		  BDP_PERCENT(task_limit, avg_dirty, TASK_SOFT_DIRTY_LIMIT),
+		  /* task weight: proportion of recent dirtied pages */
+		  BDP_PERCENT(bdi_limit, task_limit, TASK_SOFT_DIRTY_LIMIT),
+		  __entry->bdi_bw,	/* bdi write bandwidth */
+		  __entry->base_bw,	/* bdi base throttle bandwidth */
+		  __entry->task_bw,	/* task throttle bandwidth */
+		  __entry->dirtied,
+		  __entry->period,	/* ms */
+		  __entry->think,	/* ms */
+		  __entry->pause	/* ms */
+		  )
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:17.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:18.000000000 +0800
@@ -778,6 +778,8 @@ static void balance_dirty_pages(struct a
 		pause_max = max_pause(bdi_thresh);
 
 		if (avg_dirty >= task_thresh || nr_dirty > dirty_thresh) {
+			bw = 0;
+			period = 0;
 			pause = pause_max;
 			goto pause;
 		}
@@ -805,6 +807,15 @@ static void balance_dirty_pages(struct a
 		 * it may be a light dirtier.
 		 */
 		if (unlikely(-pause < HZ*10)) {
+			trace_balance_dirty_pages(bdi,
+						  bdi_dirty,
+						  avg_dirty,
+						  bdi_thresh,
+						  task_thresh,
+						  pages_dirtied,
+						  bw,
+						  period,
+						  pause);
 			if (-pause <= HZ/10)
 				current->paused_when += period;
 			else
@@ -815,6 +826,15 @@ static void balance_dirty_pages(struct a
 		pause = clamp_val(pause, 1, pause_max);
 
 pause:
+		trace_balance_dirty_pages(bdi,
+					  bdi_dirty,
+					  avg_dirty,
+					  bdi_thresh,
+					  task_thresh,
+					  pages_dirtied,
+					  bw,
+					  period,
+					  pause);
 		current->paused_when = jiffies;
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 21/35] writeback: trace balance_dirty_pages()
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 7815 bytes --]

It would be useful for analyzing the dynamics of the throttling
algorithms, and helpful for debugging user reported problems.

Here is an interesting test to verify the theory with balance_dirty_pages()
tracing. On a partition that can do ~60MB/s, a sparse file is created and
4 rsync tasks with different write bandwidth started:

	dd if=/dev/zero of=/mnt/1T bs=1M count=1 seek=1024000
	echo 1 > /debug/tracing/events/writeback/balance_dirty_pages/enable

	rsync localhost:/mnt/1T /mnt/a --bwlimit 10000&
	rsync localhost:/mnt/1T /mnt/A --bwlimit 10000&
	rsync localhost:/mnt/1T /mnt/b --bwlimit 20000&
	rsync localhost:/mnt/1T /mnt/c --bwlimit 30000&

Trace outputs within 0.1 second, grouped by tasks:

rsync-3824  [004] 15002.076447: balance_dirty_pages: bdi=btrfs-2 weight=15% limit=130876 gap=5340 dirtied=192 pause=20

rsync-3822  [003] 15002.091701: balance_dirty_pages: bdi=btrfs-2 weight=15% limit=130777 gap=5113 dirtied=192 pause=20

rsync-3821  [006] 15002.004667: balance_dirty_pages: bdi=btrfs-2 weight=30% limit=129570 gap=3714 dirtied=64 pause=8
rsync-3821  [006] 15002.012654: balance_dirty_pages: bdi=btrfs-2 weight=30% limit=129589 gap=3733 dirtied=64 pause=8
rsync-3821  [006] 15002.021838: balance_dirty_pages: bdi=btrfs-2 weight=30% limit=129604 gap=3748 dirtied=64 pause=8
rsync-3821  [004] 15002.091193: balance_dirty_pages: bdi=btrfs-2 weight=29% limit=129583 gap=3983 dirtied=64 pause=8
rsync-3821  [004] 15002.102729: balance_dirty_pages: bdi=btrfs-2 weight=29% limit=129594 gap=3802 dirtied=64 pause=8
rsync-3821  [000] 15002.109252: balance_dirty_pages: bdi=btrfs-2 weight=29% limit=129619 gap=3827 dirtied=64 pause=8

rsync-3823  [002] 15002.009029: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128762 gap=2842 dirtied=64 pause=12
rsync-3823  [002] 15002.021598: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128813 gap=3021 dirtied=64 pause=12
rsync-3823  [003] 15002.032973: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128805 gap=2885 dirtied=64 pause=12
rsync-3823  [003] 15002.048800: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128823 gap=2967 dirtied=64 pause=12
rsync-3823  [003] 15002.060728: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128821 gap=3221 dirtied=64 pause=12
rsync-3823  [000] 15002.073152: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128825 gap=3225 dirtied=64 pause=12
rsync-3823  [005] 15002.090111: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128782 gap=3214 dirtied=64 pause=12
rsync-3823  [004] 15002.102520: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128764 gap=3036 dirtied=64 pause=12

The data vividly show that

- the heaviest writer is throttled a bit (weight=39%)

- the lighter writers run at full speed (weight=15%,15%,30%)

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   87 ++++++++++++++++++++++++++++-
 mm/page-writeback.c              |   20 ++++++
 2 files changed, 104 insertions(+), 3 deletions(-)

--- linux-next.orig/include/trace/events/writeback.h	2010-12-13 21:46:09.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-12-13 21:46:18.000000000 +0800
@@ -147,11 +147,92 @@ DEFINE_EVENT(wbc_class, name, \
 DEFINE_WBC_EVENT(wbc_writeback_start);
 DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
-DEFINE_WBC_EVENT(wbc_balance_dirty_start);
-DEFINE_WBC_EVENT(wbc_balance_dirty_written);
-DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+#define KBps(x)			((x) << (PAGE_SHIFT - 10))
+#define BDP_PERCENT(a, b, c)	(((__entry->a) - (__entry->b)) * 100 * (c) + \
+				  __entry->bdi_limit/2) / (__entry->bdi_limit|1)
+
+TRACE_EVENT(balance_dirty_pages,
+
+	TP_PROTO(struct backing_dev_info *bdi,
+		 long bdi_dirty,
+		 long avg_dirty,
+		 long bdi_limit,
+		 long task_limit,
+		 long dirtied,
+		 long task_bw,
+		 long period,
+		 long pause),
+
+	TP_ARGS(bdi, bdi_dirty, avg_dirty, bdi_limit, task_limit,
+		dirtied, task_bw, period, pause),
+
+	TP_STRUCT__entry(
+		__array(char,	bdi, 32)
+		__field(long,	bdi_dirty)
+		__field(long,	avg_dirty)
+		__field(long,	bdi_limit)
+		__field(long,	task_limit)
+		__field(long,	dirtied)
+		__field(long,	bdi_bw)
+		__field(long,	base_bw)
+		__field(long,	task_bw)
+		__field(long,	period)
+		__field(long,	think)
+		__field(long,	pause)
+	),
+
+	TP_fast_assign(
+		strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
+		__entry->bdi_dirty	= bdi_dirty;
+		__entry->avg_dirty	= avg_dirty;
+		__entry->bdi_limit	= bdi_limit;
+		__entry->task_limit	= task_limit;
+		__entry->dirtied	= dirtied;
+		__entry->bdi_bw		= KBps(bdi->write_bandwidth);
+		__entry->base_bw	= KBps(bdi->throttle_bandwidth);
+		__entry->task_bw	= KBps(task_bw);
+		__entry->think		= current->paused_when == 0 ? 0 :
+			 (long)(jiffies - current->paused_when) * 1000 / HZ;
+		__entry->period		= period * 1000 / HZ;
+		__entry->pause		= pause * 1000 / HZ;
+	),
+
+
+	/*
+	 *            [..............soft throttling range............]
+	 *            ^               |<=========== bdi_gap =========>|
+	 * (background+dirty)/2       |<== task_gap ==>|
+	 * -------------------|-------+----------------|--------------|
+	 *   (bdi_limit * 7/8)^       ^bdi_dirty       ^task_limit    ^bdi_limit
+	 *
+	 * Reasonable large gaps help produce smooth pause times.
+	 */
+	TP_printk("bdi %s: "
+		  "bdi_limit=%lu task_limit=%lu bdi_dirty=%lu avg_dirty=%lu "
+		  "bdi_gap=%ld%% task_gap=%ld%% task_weight=%ld%% "
+		  "bdi_bw=%lu base_bw=%lu task_bw=%lu "
+		  "dirtied=%lu period=%lu think=%ld pause=%ld",
+		  __entry->bdi,
+		  __entry->bdi_limit,
+		  __entry->task_limit,
+		  __entry->bdi_dirty,
+		  __entry->avg_dirty,
+		  BDP_PERCENT(bdi_limit, bdi_dirty, BDI_SOFT_DIRTY_LIMIT),
+		  BDP_PERCENT(task_limit, avg_dirty, TASK_SOFT_DIRTY_LIMIT),
+		  /* task weight: proportion of recent dirtied pages */
+		  BDP_PERCENT(bdi_limit, task_limit, TASK_SOFT_DIRTY_LIMIT),
+		  __entry->bdi_bw,	/* bdi write bandwidth */
+		  __entry->base_bw,	/* bdi base throttle bandwidth */
+		  __entry->task_bw,	/* task throttle bandwidth */
+		  __entry->dirtied,
+		  __entry->period,	/* ms */
+		  __entry->think,	/* ms */
+		  __entry->pause	/* ms */
+		  )
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:17.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:18.000000000 +0800
@@ -778,6 +778,8 @@ static void balance_dirty_pages(struct a
 		pause_max = max_pause(bdi_thresh);
 
 		if (avg_dirty >= task_thresh || nr_dirty > dirty_thresh) {
+			bw = 0;
+			period = 0;
 			pause = pause_max;
 			goto pause;
 		}
@@ -805,6 +807,15 @@ static void balance_dirty_pages(struct a
 		 * it may be a light dirtier.
 		 */
 		if (unlikely(-pause < HZ*10)) {
+			trace_balance_dirty_pages(bdi,
+						  bdi_dirty,
+						  avg_dirty,
+						  bdi_thresh,
+						  task_thresh,
+						  pages_dirtied,
+						  bw,
+						  period,
+						  pause);
 			if (-pause <= HZ/10)
 				current->paused_when += period;
 			else
@@ -815,6 +826,15 @@ static void balance_dirty_pages(struct a
 		pause = clamp_val(pause, 1, pause_max);
 
 pause:
+		trace_balance_dirty_pages(bdi,
+					  bdi_dirty,
+					  avg_dirty,
+					  bdi_thresh,
+					  task_thresh,
+					  pages_dirtied,
+					  bw,
+					  period,
+					  pause);
 		current->paused_when = jiffies;
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 21/35] writeback: trace balance_dirty_pages()
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 7815 bytes --]

It would be useful for analyzing the dynamics of the throttling
algorithms, and helpful for debugging user reported problems.

Here is an interesting test to verify the theory with balance_dirty_pages()
tracing. On a partition that can do ~60MB/s, a sparse file is created and
4 rsync tasks with different write bandwidth started:

	dd if=/dev/zero of=/mnt/1T bs=1M count=1 seek=1024000
	echo 1 > /debug/tracing/events/writeback/balance_dirty_pages/enable

	rsync localhost:/mnt/1T /mnt/a --bwlimit 10000&
	rsync localhost:/mnt/1T /mnt/A --bwlimit 10000&
	rsync localhost:/mnt/1T /mnt/b --bwlimit 20000&
	rsync localhost:/mnt/1T /mnt/c --bwlimit 30000&

Trace outputs within 0.1 second, grouped by tasks:

rsync-3824  [004] 15002.076447: balance_dirty_pages: bdi=btrfs-2 weight=15% limit=130876 gap=5340 dirtied=192 pause=20

rsync-3822  [003] 15002.091701: balance_dirty_pages: bdi=btrfs-2 weight=15% limit=130777 gap=5113 dirtied=192 pause=20

rsync-3821  [006] 15002.004667: balance_dirty_pages: bdi=btrfs-2 weight=30% limit=129570 gap=3714 dirtied=64 pause=8
rsync-3821  [006] 15002.012654: balance_dirty_pages: bdi=btrfs-2 weight=30% limit=129589 gap=3733 dirtied=64 pause=8
rsync-3821  [006] 15002.021838: balance_dirty_pages: bdi=btrfs-2 weight=30% limit=129604 gap=3748 dirtied=64 pause=8
rsync-3821  [004] 15002.091193: balance_dirty_pages: bdi=btrfs-2 weight=29% limit=129583 gap=3983 dirtied=64 pause=8
rsync-3821  [004] 15002.102729: balance_dirty_pages: bdi=btrfs-2 weight=29% limit=129594 gap=3802 dirtied=64 pause=8
rsync-3821  [000] 15002.109252: balance_dirty_pages: bdi=btrfs-2 weight=29% limit=129619 gap=3827 dirtied=64 pause=8

rsync-3823  [002] 15002.009029: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128762 gap=2842 dirtied=64 pause=12
rsync-3823  [002] 15002.021598: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128813 gap=3021 dirtied=64 pause=12
rsync-3823  [003] 15002.032973: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128805 gap=2885 dirtied=64 pause=12
rsync-3823  [003] 15002.048800: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128823 gap=2967 dirtied=64 pause=12
rsync-3823  [003] 15002.060728: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128821 gap=3221 dirtied=64 pause=12
rsync-3823  [000] 15002.073152: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128825 gap=3225 dirtied=64 pause=12
rsync-3823  [005] 15002.090111: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128782 gap=3214 dirtied=64 pause=12
rsync-3823  [004] 15002.102520: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128764 gap=3036 dirtied=64 pause=12

The data vividly show that

- the heaviest writer is throttled a bit (weight=39%)

- the lighter writers run at full speed (weight=15%,15%,30%)

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   87 ++++++++++++++++++++++++++++-
 mm/page-writeback.c              |   20 ++++++
 2 files changed, 104 insertions(+), 3 deletions(-)

--- linux-next.orig/include/trace/events/writeback.h	2010-12-13 21:46:09.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-12-13 21:46:18.000000000 +0800
@@ -147,11 +147,92 @@ DEFINE_EVENT(wbc_class, name, \
 DEFINE_WBC_EVENT(wbc_writeback_start);
 DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
-DEFINE_WBC_EVENT(wbc_balance_dirty_start);
-DEFINE_WBC_EVENT(wbc_balance_dirty_written);
-DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+#define KBps(x)			((x) << (PAGE_SHIFT - 10))
+#define BDP_PERCENT(a, b, c)	(((__entry->a) - (__entry->b)) * 100 * (c) + \
+				  __entry->bdi_limit/2) / (__entry->bdi_limit|1)
+
+TRACE_EVENT(balance_dirty_pages,
+
+	TP_PROTO(struct backing_dev_info *bdi,
+		 long bdi_dirty,
+		 long avg_dirty,
+		 long bdi_limit,
+		 long task_limit,
+		 long dirtied,
+		 long task_bw,
+		 long period,
+		 long pause),
+
+	TP_ARGS(bdi, bdi_dirty, avg_dirty, bdi_limit, task_limit,
+		dirtied, task_bw, period, pause),
+
+	TP_STRUCT__entry(
+		__array(char,	bdi, 32)
+		__field(long,	bdi_dirty)
+		__field(long,	avg_dirty)
+		__field(long,	bdi_limit)
+		__field(long,	task_limit)
+		__field(long,	dirtied)
+		__field(long,	bdi_bw)
+		__field(long,	base_bw)
+		__field(long,	task_bw)
+		__field(long,	period)
+		__field(long,	think)
+		__field(long,	pause)
+	),
+
+	TP_fast_assign(
+		strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
+		__entry->bdi_dirty	= bdi_dirty;
+		__entry->avg_dirty	= avg_dirty;
+		__entry->bdi_limit	= bdi_limit;
+		__entry->task_limit	= task_limit;
+		__entry->dirtied	= dirtied;
+		__entry->bdi_bw		= KBps(bdi->write_bandwidth);
+		__entry->base_bw	= KBps(bdi->throttle_bandwidth);
+		__entry->task_bw	= KBps(task_bw);
+		__entry->think		= current->paused_when == 0 ? 0 :
+			 (long)(jiffies - current->paused_when) * 1000 / HZ;
+		__entry->period		= period * 1000 / HZ;
+		__entry->pause		= pause * 1000 / HZ;
+	),
+
+
+	/*
+	 *            [..............soft throttling range............]
+	 *            ^               |<=========== bdi_gap =========>|
+	 * (background+dirty)/2       |<== task_gap ==>|
+	 * -------------------|-------+----------------|--------------|
+	 *   (bdi_limit * 7/8)^       ^bdi_dirty       ^task_limit    ^bdi_limit
+	 *
+	 * Reasonable large gaps help produce smooth pause times.
+	 */
+	TP_printk("bdi %s: "
+		  "bdi_limit=%lu task_limit=%lu bdi_dirty=%lu avg_dirty=%lu "
+		  "bdi_gap=%ld%% task_gap=%ld%% task_weight=%ld%% "
+		  "bdi_bw=%lu base_bw=%lu task_bw=%lu "
+		  "dirtied=%lu period=%lu think=%ld pause=%ld",
+		  __entry->bdi,
+		  __entry->bdi_limit,
+		  __entry->task_limit,
+		  __entry->bdi_dirty,
+		  __entry->avg_dirty,
+		  BDP_PERCENT(bdi_limit, bdi_dirty, BDI_SOFT_DIRTY_LIMIT),
+		  BDP_PERCENT(task_limit, avg_dirty, TASK_SOFT_DIRTY_LIMIT),
+		  /* task weight: proportion of recent dirtied pages */
+		  BDP_PERCENT(bdi_limit, task_limit, TASK_SOFT_DIRTY_LIMIT),
+		  __entry->bdi_bw,	/* bdi write bandwidth */
+		  __entry->base_bw,	/* bdi base throttle bandwidth */
+		  __entry->task_bw,	/* task throttle bandwidth */
+		  __entry->dirtied,
+		  __entry->period,	/* ms */
+		  __entry->think,	/* ms */
+		  __entry->pause	/* ms */
+		  )
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:17.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:18.000000000 +0800
@@ -778,6 +778,8 @@ static void balance_dirty_pages(struct a
 		pause_max = max_pause(bdi_thresh);
 
 		if (avg_dirty >= task_thresh || nr_dirty > dirty_thresh) {
+			bw = 0;
+			period = 0;
 			pause = pause_max;
 			goto pause;
 		}
@@ -805,6 +807,15 @@ static void balance_dirty_pages(struct a
 		 * it may be a light dirtier.
 		 */
 		if (unlikely(-pause < HZ*10)) {
+			trace_balance_dirty_pages(bdi,
+						  bdi_dirty,
+						  avg_dirty,
+						  bdi_thresh,
+						  task_thresh,
+						  pages_dirtied,
+						  bw,
+						  period,
+						  pause);
 			if (-pause <= HZ/10)
 				current->paused_when += period;
 			else
@@ -815,6 +826,15 @@ static void balance_dirty_pages(struct a
 		pause = clamp_val(pause, 1, pause_max);
 
 pause:
+		trace_balance_dirty_pages(bdi,
+					  bdi_dirty,
+					  avg_dirty,
+					  bdi_thresh,
+					  task_thresh,
+					  pages_dirtied,
+					  bw,
+					  period,
+					  pause);
 		current->paused_when = jiffies;
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 22/35] writeback: trace global dirty page states
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:47   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-global-dirty-states.patch --]
[-- Type: text/plain, Size: 3416 bytes --]

Add trace balance_dirty_state for showing the global dirty page counts
and thresholds at each balance_dirty_pages() loop.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   57 +++++++++++++++++++++++++++++
 mm/page-writeback.c              |   15 ++++++-
 2 files changed, 69 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:18.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:18.000000000 +0800
@@ -713,12 +713,21 @@ static void balance_dirty_pages(struct a
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+		nr_reclaimable = global_page_state(NR_FILE_DIRTY);
+		bdi_dirty = global_page_state(NR_UNSTABLE_NFS);
+		nr_dirty = global_page_state(NR_WRITEBACK);
 
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 
+		trace_balance_dirty_state(mapping,
+					  nr_reclaimable,
+					  nr_dirty,
+					  bdi_dirty,
+					  background_thresh,
+					  dirty_thresh);
+		nr_reclaimable += bdi_dirty;
+		nr_dirty += nr_reclaimable;
+
 		/*
 		 * Throttle it only when the background writeback cannot
 		 * catch-up. This avoids (excessively) small writeouts
--- linux-next.orig/include/trace/events/writeback.h	2010-12-13 21:46:18.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-12-13 21:46:18.000000000 +0800
@@ -149,6 +149,63 @@ DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+TRACE_EVENT(balance_dirty_state,
+
+	TP_PROTO(struct address_space *mapping,
+		 unsigned long nr_dirty,
+		 unsigned long nr_writeback,
+		 unsigned long nr_unstable,
+		 unsigned long background_thresh,
+		 unsigned long dirty_thresh
+	),
+
+	TP_ARGS(mapping,
+		nr_dirty,
+		nr_writeback,
+		nr_unstable,
+		background_thresh,
+		dirty_thresh
+	),
+
+	TP_STRUCT__entry(
+		__array(char,		bdi, 32)
+		__field(unsigned long,	ino)
+		__field(unsigned long,	nr_dirty)
+		__field(unsigned long,	nr_writeback)
+		__field(unsigned long,	nr_unstable)
+		__field(unsigned long,	background_thresh)
+		__field(unsigned long,	dirty_thresh)
+		__field(unsigned long,	task_dirtied_pause)
+	),
+
+	TP_fast_assign(
+		strlcpy(__entry->bdi,
+			dev_name(mapping->backing_dev_info->dev), 32);
+		__entry->ino			= mapping->host->i_ino;
+		__entry->nr_dirty		= nr_dirty;
+		__entry->nr_writeback		= nr_writeback;
+		__entry->nr_unstable		= nr_unstable;
+		__entry->background_thresh	= background_thresh;
+		__entry->dirty_thresh		= dirty_thresh;
+		__entry->task_dirtied_pause	= current->nr_dirtied_pause;
+	),
+
+	TP_printk("bdi %s: dirty=%lu wb=%lu unstable=%lu "
+		  "bg_thresh=%lu thresh=%lu gap=%ld "
+		  "poll_thresh=%lu ino=%lu",
+		  __entry->bdi,
+		  __entry->nr_dirty,
+		  __entry->nr_writeback,
+		  __entry->nr_unstable,
+		  __entry->background_thresh,
+		  __entry->dirty_thresh,
+		  __entry->dirty_thresh - __entry->nr_dirty -
+		  __entry->nr_writeback - __entry->nr_unstable,
+		  __entry->task_dirtied_pause,
+		  __entry->ino
+	)
+);
+
 #define KBps(x)			((x) << (PAGE_SHIFT - 10))
 #define BDP_PERCENT(a, b, c)	(((__entry->a) - (__entry->b)) * 100 * (c) + \
 				  __entry->bdi_limit/2) / (__entry->bdi_limit|1)



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 22/35] writeback: trace global dirty page states
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-global-dirty-states.patch --]
[-- Type: text/plain, Size: 3712 bytes --]

Add trace balance_dirty_state for showing the global dirty page counts
and thresholds at each balance_dirty_pages() loop.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   57 +++++++++++++++++++++++++++++
 mm/page-writeback.c              |   15 ++++++-
 2 files changed, 69 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:18.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:18.000000000 +0800
@@ -713,12 +713,21 @@ static void balance_dirty_pages(struct a
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+		nr_reclaimable = global_page_state(NR_FILE_DIRTY);
+		bdi_dirty = global_page_state(NR_UNSTABLE_NFS);
+		nr_dirty = global_page_state(NR_WRITEBACK);
 
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 
+		trace_balance_dirty_state(mapping,
+					  nr_reclaimable,
+					  nr_dirty,
+					  bdi_dirty,
+					  background_thresh,
+					  dirty_thresh);
+		nr_reclaimable += bdi_dirty;
+		nr_dirty += nr_reclaimable;
+
 		/*
 		 * Throttle it only when the background writeback cannot
 		 * catch-up. This avoids (excessively) small writeouts
--- linux-next.orig/include/trace/events/writeback.h	2010-12-13 21:46:18.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-12-13 21:46:18.000000000 +0800
@@ -149,6 +149,63 @@ DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+TRACE_EVENT(balance_dirty_state,
+
+	TP_PROTO(struct address_space *mapping,
+		 unsigned long nr_dirty,
+		 unsigned long nr_writeback,
+		 unsigned long nr_unstable,
+		 unsigned long background_thresh,
+		 unsigned long dirty_thresh
+	),
+
+	TP_ARGS(mapping,
+		nr_dirty,
+		nr_writeback,
+		nr_unstable,
+		background_thresh,
+		dirty_thresh
+	),
+
+	TP_STRUCT__entry(
+		__array(char,		bdi, 32)
+		__field(unsigned long,	ino)
+		__field(unsigned long,	nr_dirty)
+		__field(unsigned long,	nr_writeback)
+		__field(unsigned long,	nr_unstable)
+		__field(unsigned long,	background_thresh)
+		__field(unsigned long,	dirty_thresh)
+		__field(unsigned long,	task_dirtied_pause)
+	),
+
+	TP_fast_assign(
+		strlcpy(__entry->bdi,
+			dev_name(mapping->backing_dev_info->dev), 32);
+		__entry->ino			= mapping->host->i_ino;
+		__entry->nr_dirty		= nr_dirty;
+		__entry->nr_writeback		= nr_writeback;
+		__entry->nr_unstable		= nr_unstable;
+		__entry->background_thresh	= background_thresh;
+		__entry->dirty_thresh		= dirty_thresh;
+		__entry->task_dirtied_pause	= current->nr_dirtied_pause;
+	),
+
+	TP_printk("bdi %s: dirty=%lu wb=%lu unstable=%lu "
+		  "bg_thresh=%lu thresh=%lu gap=%ld "
+		  "poll_thresh=%lu ino=%lu",
+		  __entry->bdi,
+		  __entry->nr_dirty,
+		  __entry->nr_writeback,
+		  __entry->nr_unstable,
+		  __entry->background_thresh,
+		  __entry->dirty_thresh,
+		  __entry->dirty_thresh - __entry->nr_dirty -
+		  __entry->nr_writeback - __entry->nr_unstable,
+		  __entry->task_dirtied_pause,
+		  __entry->ino
+	)
+);
+
 #define KBps(x)			((x) << (PAGE_SHIFT - 10))
 #define BDP_PERCENT(a, b, c)	(((__entry->a) - (__entry->b)) * 100 * (c) + \
 				  __entry->bdi_limit/2) / (__entry->bdi_limit|1)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 22/35] writeback: trace global dirty page states
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-global-dirty-states.patch --]
[-- Type: text/plain, Size: 3712 bytes --]

Add trace balance_dirty_state for showing the global dirty page counts
and thresholds at each balance_dirty_pages() loop.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   57 +++++++++++++++++++++++++++++
 mm/page-writeback.c              |   15 ++++++-
 2 files changed, 69 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:18.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:18.000000000 +0800
@@ -713,12 +713,21 @@ static void balance_dirty_pages(struct a
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+		nr_reclaimable = global_page_state(NR_FILE_DIRTY);
+		bdi_dirty = global_page_state(NR_UNSTABLE_NFS);
+		nr_dirty = global_page_state(NR_WRITEBACK);
 
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 
+		trace_balance_dirty_state(mapping,
+					  nr_reclaimable,
+					  nr_dirty,
+					  bdi_dirty,
+					  background_thresh,
+					  dirty_thresh);
+		nr_reclaimable += bdi_dirty;
+		nr_dirty += nr_reclaimable;
+
 		/*
 		 * Throttle it only when the background writeback cannot
 		 * catch-up. This avoids (excessively) small writeouts
--- linux-next.orig/include/trace/events/writeback.h	2010-12-13 21:46:18.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-12-13 21:46:18.000000000 +0800
@@ -149,6 +149,63 @@ DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+TRACE_EVENT(balance_dirty_state,
+
+	TP_PROTO(struct address_space *mapping,
+		 unsigned long nr_dirty,
+		 unsigned long nr_writeback,
+		 unsigned long nr_unstable,
+		 unsigned long background_thresh,
+		 unsigned long dirty_thresh
+	),
+
+	TP_ARGS(mapping,
+		nr_dirty,
+		nr_writeback,
+		nr_unstable,
+		background_thresh,
+		dirty_thresh
+	),
+
+	TP_STRUCT__entry(
+		__array(char,		bdi, 32)
+		__field(unsigned long,	ino)
+		__field(unsigned long,	nr_dirty)
+		__field(unsigned long,	nr_writeback)
+		__field(unsigned long,	nr_unstable)
+		__field(unsigned long,	background_thresh)
+		__field(unsigned long,	dirty_thresh)
+		__field(unsigned long,	task_dirtied_pause)
+	),
+
+	TP_fast_assign(
+		strlcpy(__entry->bdi,
+			dev_name(mapping->backing_dev_info->dev), 32);
+		__entry->ino			= mapping->host->i_ino;
+		__entry->nr_dirty		= nr_dirty;
+		__entry->nr_writeback		= nr_writeback;
+		__entry->nr_unstable		= nr_unstable;
+		__entry->background_thresh	= background_thresh;
+		__entry->dirty_thresh		= dirty_thresh;
+		__entry->task_dirtied_pause	= current->nr_dirtied_pause;
+	),
+
+	TP_printk("bdi %s: dirty=%lu wb=%lu unstable=%lu "
+		  "bg_thresh=%lu thresh=%lu gap=%ld "
+		  "poll_thresh=%lu ino=%lu",
+		  __entry->bdi,
+		  __entry->nr_dirty,
+		  __entry->nr_writeback,
+		  __entry->nr_unstable,
+		  __entry->background_thresh,
+		  __entry->dirty_thresh,
+		  __entry->dirty_thresh - __entry->nr_dirty -
+		  __entry->nr_writeback - __entry->nr_unstable,
+		  __entry->task_dirtied_pause,
+		  __entry->ino
+	)
+);
+
 #define KBps(x)			((x) << (PAGE_SHIFT - 10))
 #define BDP_PERCENT(a, b, c)	(((__entry->a) - (__entry->b)) * 100 * (c) + \
 				  __entry->bdi_limit/2) / (__entry->bdi_limit|1)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 23/35] writeback: trace writeback_single_inode()
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:47   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-writeback_single_inode.patch --]
[-- Type: text/plain, Size: 3303 bytes --]

It is valuable to know how the inodes are iterated and their IO size.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |   12 +++---
 include/trace/events/writeback.h |   52 +++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-12-13 21:46:17.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-12-13 21:46:19.000000000 +0800
@@ -331,7 +331,7 @@ writeback_single_inode(struct inode *ino
 {
 	struct address_space *mapping = inode->i_mapping;
 	long per_file_limit = wbc->per_file_limit;
-	long uninitialized_var(nr_to_write);
+	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
 
@@ -351,7 +351,8 @@ writeback_single_inode(struct inode *ino
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
 			requeue_io(inode);
-			return 0;
+			ret = 0;
+			goto out;
 		}
 
 		/*
@@ -367,10 +368,8 @@ writeback_single_inode(struct inode *ino
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode_lock);
 
-	if (per_file_limit) {
-		nr_to_write = wbc->nr_to_write;
+	if (per_file_limit)
 		wbc->nr_to_write = per_file_limit;
-	}
 
 	ret = do_writepages(mapping, wbc);
 
@@ -446,6 +445,9 @@ writeback_single_inode(struct inode *ino
 		}
 	}
 	inode_sync_complete(inode);
+out:
+	trace_writeback_single_inode(inode, wbc,
+				     nr_to_write - wbc->nr_to_write);
 	return ret;
 }
 
--- linux-next.orig/include/trace/events/writeback.h	2010-12-13 21:46:18.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-12-13 21:46:19.000000000 +0800
@@ -10,6 +10,19 @@
 
 struct wb_writeback_work;
 
+#define show_inode_state(state)					\
+	__print_flags(state, "|",				\
+		{I_DIRTY_SYNC,		"I_DIRTY_SYNC"},	\
+		{I_DIRTY_DATASYNC,	"I_DIRTY_DATASYNC"},	\
+		{I_DIRTY_PAGES,		"I_DIRTY_PAGES"},	\
+		{I_NEW,			"I_NEW"},		\
+		{I_WILL_FREE,		"I_WILL_FREE"},		\
+		{I_FREEING,		"I_FREEING"},		\
+		{I_CLEAR,		"I_CLEAR"},		\
+		{I_SYNC,		"I_SYNC"},		\
+		{I_REFERENCED,		"I_REFERENCED"}		\
+		)
+
 DECLARE_EVENT_CLASS(writeback_work_class,
 	TP_PROTO(struct backing_dev_info *bdi, struct wb_writeback_work *work),
 	TP_ARGS(bdi, work),
@@ -149,6 +162,45 @@ DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+TRACE_EVENT(writeback_single_inode,
+
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 unsigned long wrote
+	),
+
+	TP_ARGS(inode, wbc, wrote),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, ino)
+		__field(unsigned long, state)
+		__field(unsigned long, age)
+		__field(unsigned long, wrote)
+		__field(long, nr_to_write)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->state		= inode->i_state;
+		__entry->age		= (jiffies - inode->dirtied_when) *
+								1000 / HZ;
+		__entry->wrote		= wrote;
+		__entry->nr_to_write	= wbc->nr_to_write;
+	),
+
+	TP_printk("bdi %s: ino=%lu state=%s age=%lu wrote=%lu to_write=%ld",
+		  __entry->name,
+		  __entry->ino,
+		  show_inode_state(__entry->state),
+		  __entry->age,
+		  __entry->wrote,
+		  __entry->nr_to_write
+	)
+);
+
 TRACE_EVENT(balance_dirty_state,
 
 	TP_PROTO(struct address_space *mapping,



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 23/35] writeback: trace writeback_single_inode()
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-writeback_single_inode.patch --]
[-- Type: text/plain, Size: 3599 bytes --]

It is valuable to know how the inodes are iterated and their IO size.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |   12 +++---
 include/trace/events/writeback.h |   52 +++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-12-13 21:46:17.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-12-13 21:46:19.000000000 +0800
@@ -331,7 +331,7 @@ writeback_single_inode(struct inode *ino
 {
 	struct address_space *mapping = inode->i_mapping;
 	long per_file_limit = wbc->per_file_limit;
-	long uninitialized_var(nr_to_write);
+	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
 
@@ -351,7 +351,8 @@ writeback_single_inode(struct inode *ino
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
 			requeue_io(inode);
-			return 0;
+			ret = 0;
+			goto out;
 		}
 
 		/*
@@ -367,10 +368,8 @@ writeback_single_inode(struct inode *ino
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode_lock);
 
-	if (per_file_limit) {
-		nr_to_write = wbc->nr_to_write;
+	if (per_file_limit)
 		wbc->nr_to_write = per_file_limit;
-	}
 
 	ret = do_writepages(mapping, wbc);
 
@@ -446,6 +445,9 @@ writeback_single_inode(struct inode *ino
 		}
 	}
 	inode_sync_complete(inode);
+out:
+	trace_writeback_single_inode(inode, wbc,
+				     nr_to_write - wbc->nr_to_write);
 	return ret;
 }
 
--- linux-next.orig/include/trace/events/writeback.h	2010-12-13 21:46:18.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-12-13 21:46:19.000000000 +0800
@@ -10,6 +10,19 @@
 
 struct wb_writeback_work;
 
+#define show_inode_state(state)					\
+	__print_flags(state, "|",				\
+		{I_DIRTY_SYNC,		"I_DIRTY_SYNC"},	\
+		{I_DIRTY_DATASYNC,	"I_DIRTY_DATASYNC"},	\
+		{I_DIRTY_PAGES,		"I_DIRTY_PAGES"},	\
+		{I_NEW,			"I_NEW"},		\
+		{I_WILL_FREE,		"I_WILL_FREE"},		\
+		{I_FREEING,		"I_FREEING"},		\
+		{I_CLEAR,		"I_CLEAR"},		\
+		{I_SYNC,		"I_SYNC"},		\
+		{I_REFERENCED,		"I_REFERENCED"}		\
+		)
+
 DECLARE_EVENT_CLASS(writeback_work_class,
 	TP_PROTO(struct backing_dev_info *bdi, struct wb_writeback_work *work),
 	TP_ARGS(bdi, work),
@@ -149,6 +162,45 @@ DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+TRACE_EVENT(writeback_single_inode,
+
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 unsigned long wrote
+	),
+
+	TP_ARGS(inode, wbc, wrote),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, ino)
+		__field(unsigned long, state)
+		__field(unsigned long, age)
+		__field(unsigned long, wrote)
+		__field(long, nr_to_write)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->state		= inode->i_state;
+		__entry->age		= (jiffies - inode->dirtied_when) *
+								1000 / HZ;
+		__entry->wrote		= wrote;
+		__entry->nr_to_write	= wbc->nr_to_write;
+	),
+
+	TP_printk("bdi %s: ino=%lu state=%s age=%lu wrote=%lu to_write=%ld",
+		  __entry->name,
+		  __entry->ino,
+		  show_inode_state(__entry->state),
+		  __entry->age,
+		  __entry->wrote,
+		  __entry->nr_to_write
+	)
+);
+
 TRACE_EVENT(balance_dirty_state,
 
 	TP_PROTO(struct address_space *mapping,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 23/35] writeback: trace writeback_single_inode()
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-writeback_single_inode.patch --]
[-- Type: text/plain, Size: 3599 bytes --]

It is valuable to know how the inodes are iterated and their IO size.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |   12 +++---
 include/trace/events/writeback.h |   52 +++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-12-13 21:46:17.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-12-13 21:46:19.000000000 +0800
@@ -331,7 +331,7 @@ writeback_single_inode(struct inode *ino
 {
 	struct address_space *mapping = inode->i_mapping;
 	long per_file_limit = wbc->per_file_limit;
-	long uninitialized_var(nr_to_write);
+	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
 
@@ -351,7 +351,8 @@ writeback_single_inode(struct inode *ino
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
 			requeue_io(inode);
-			return 0;
+			ret = 0;
+			goto out;
 		}
 
 		/*
@@ -367,10 +368,8 @@ writeback_single_inode(struct inode *ino
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode_lock);
 
-	if (per_file_limit) {
-		nr_to_write = wbc->nr_to_write;
+	if (per_file_limit)
 		wbc->nr_to_write = per_file_limit;
-	}
 
 	ret = do_writepages(mapping, wbc);
 
@@ -446,6 +445,9 @@ writeback_single_inode(struct inode *ino
 		}
 	}
 	inode_sync_complete(inode);
+out:
+	trace_writeback_single_inode(inode, wbc,
+				     nr_to_write - wbc->nr_to_write);
 	return ret;
 }
 
--- linux-next.orig/include/trace/events/writeback.h	2010-12-13 21:46:18.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-12-13 21:46:19.000000000 +0800
@@ -10,6 +10,19 @@
 
 struct wb_writeback_work;
 
+#define show_inode_state(state)					\
+	__print_flags(state, "|",				\
+		{I_DIRTY_SYNC,		"I_DIRTY_SYNC"},	\
+		{I_DIRTY_DATASYNC,	"I_DIRTY_DATASYNC"},	\
+		{I_DIRTY_PAGES,		"I_DIRTY_PAGES"},	\
+		{I_NEW,			"I_NEW"},		\
+		{I_WILL_FREE,		"I_WILL_FREE"},		\
+		{I_FREEING,		"I_FREEING"},		\
+		{I_CLEAR,		"I_CLEAR"},		\
+		{I_SYNC,		"I_SYNC"},		\
+		{I_REFERENCED,		"I_REFERENCED"}		\
+		)
+
 DECLARE_EVENT_CLASS(writeback_work_class,
 	TP_PROTO(struct backing_dev_info *bdi, struct wb_writeback_work *work),
 	TP_ARGS(bdi, work),
@@ -149,6 +162,45 @@ DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+TRACE_EVENT(writeback_single_inode,
+
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 unsigned long wrote
+	),
+
+	TP_ARGS(inode, wbc, wrote),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, ino)
+		__field(unsigned long, state)
+		__field(unsigned long, age)
+		__field(unsigned long, wrote)
+		__field(long, nr_to_write)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->state		= inode->i_state;
+		__entry->age		= (jiffies - inode->dirtied_when) *
+								1000 / HZ;
+		__entry->wrote		= wrote;
+		__entry->nr_to_write	= wbc->nr_to_write;
+	),
+
+	TP_printk("bdi %s: ino=%lu state=%s age=%lu wrote=%lu to_write=%ld",
+		  __entry->name,
+		  __entry->ino,
+		  show_inode_state(__entry->state),
+		  __entry->age,
+		  __entry->wrote,
+		  __entry->nr_to_write
+	)
+);
+
 TRACE_EVENT(balance_dirty_state,
 
 	TP_PROTO(struct address_space *mapping,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 24/35] btrfs: dont call balance_dirty_pages_ratelimited() on already dirty pages
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:47   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Chris Mason, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: btrfs-fix-balance-size.patch --]
[-- Type: text/plain, Size: 3854 bytes --]

When doing 1KB sequential writes to the same page,
balance_dirty_pages_ratelimited() should be called once instead of 4
times. Failing to do so will make all tasks throttled much too heavy.

CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/file.c       |   11 +++++++----
 fs/btrfs/ioctl.c      |    6 ++++--
 fs/btrfs/relocation.c |    6 ++++--
 3 files changed, 15 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/btrfs/file.c	2010-12-13 21:45:55.000000000 +0800
+++ linux-next/fs/btrfs/file.c	2010-12-13 21:46:19.000000000 +0800
@@ -762,7 +762,8 @@ out:
 static noinline int prepare_pages(struct btrfs_root *root, struct file *file,
 			 struct page **pages, size_t num_pages,
 			 loff_t pos, unsigned long first_index,
-			 unsigned long last_index, size_t write_bytes)
+			 unsigned long last_index, size_t write_bytes,
+			 int *nr_dirtied)
 {
 	struct extent_state *cached_state = NULL;
 	int i;
@@ -825,7 +826,8 @@ again:
 				     GFP_NOFS);
 	}
 	for (i = 0; i < num_pages; i++) {
-		clear_page_dirty_for_io(pages[i]);
+		if (!clear_page_dirty_for_io(pages[i]))
+			(*nr_dirtied)++;
 		set_page_extent_mapped(pages[i]);
 		WARN_ON(!PageLocked(pages[i]));
 	}
@@ -966,6 +968,7 @@ static ssize_t btrfs_file_aio_write(stru
 					 offset);
 		size_t num_pages = (write_bytes + PAGE_CACHE_SIZE - 1) >>
 					PAGE_CACHE_SHIFT;
+		int nr_dirtied = 0;
 
 		WARN_ON(num_pages > nrptrs);
 		memset(pages, 0, sizeof(struct page *) * nrptrs);
@@ -976,7 +979,7 @@ static ssize_t btrfs_file_aio_write(stru
 
 		ret = prepare_pages(root, file, pages, num_pages,
 				    pos, first_index, last_index,
-				    write_bytes);
+				    write_bytes, &nr_dirtied);
 		if (ret) {
 			btrfs_delalloc_release_space(inode, write_bytes);
 			goto out;
@@ -1000,7 +1003,7 @@ static ssize_t btrfs_file_aio_write(stru
 						 pos + write_bytes - 1);
 		} else {
 			balance_dirty_pages_ratelimited_nr(inode->i_mapping,
-							   num_pages);
+							   nr_dirtied);
 			if (num_pages <
 			    (root->leafsize >> PAGE_CACHE_SHIFT) + 1)
 				btrfs_btree_balance_dirty(root, 1);
--- linux-next.orig/fs/btrfs/ioctl.c	2010-12-13 21:45:55.000000000 +0800
+++ linux-next/fs/btrfs/ioctl.c	2010-12-13 21:46:19.000000000 +0800
@@ -647,6 +647,7 @@ static int btrfs_defrag_file(struct file
 	u64 skip = 0;
 	u64 defrag_end = 0;
 	unsigned long i;
+	int dirtied;
 	int ret;
 
 	if (inode->i_size == 0)
@@ -751,7 +752,7 @@ again:
 
 		btrfs_set_extent_delalloc(inode, page_start, page_end, NULL);
 		ClearPageChecked(page);
-		set_page_dirty(page);
+		dirtied = set_page_dirty(page);
 		unlock_extent(io_tree, page_start, page_end, GFP_NOFS);
 
 loop_unlock:
@@ -759,7 +760,8 @@ loop_unlock:
 		page_cache_release(page);
 		mutex_unlock(&inode->i_mutex);
 
-		balance_dirty_pages_ratelimited_nr(inode->i_mapping, 1);
+		if (dirtied)
+			balance_dirty_pages_ratelimited_nr(inode->i_mapping, 1);
 		i++;
 	}
 
--- linux-next.orig/fs/btrfs/relocation.c	2010-12-13 21:45:55.000000000 +0800
+++ linux-next/fs/btrfs/relocation.c	2010-12-13 21:46:19.000000000 +0800
@@ -2894,6 +2894,7 @@ static int relocate_file_extent_cluster(
 	struct file_ra_state *ra;
 	int nr = 0;
 	int ret = 0;
+	int dirtied;
 
 	if (!cluster->nr)
 		return 0;
@@ -2970,7 +2971,7 @@ static int relocate_file_extent_cluster(
 		}
 
 		btrfs_set_extent_delalloc(inode, page_start, page_end, NULL);
-		set_page_dirty(page);
+		dirtied = set_page_dirty(page);
 
 		unlock_extent(&BTRFS_I(inode)->io_tree,
 			      page_start, page_end, GFP_NOFS);
@@ -2978,7 +2979,8 @@ static int relocate_file_extent_cluster(
 		page_cache_release(page);
 
 		index++;
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		if (dirtied)
+			balance_dirty_pages_ratelimited(inode->i_mapping);
 		btrfs_throttle(BTRFS_I(inode)->root);
 	}
 	WARN_ON(nr != cluster->nr);



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 24/35] btrfs: dont call balance_dirty_pages_ratelimited() on already dirty pages
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Chris Mason, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: btrfs-fix-balance-size.patch --]
[-- Type: text/plain, Size: 4150 bytes --]

When doing 1KB sequential writes to the same page,
balance_dirty_pages_ratelimited() should be called once instead of 4
times. Failing to do so will make all tasks throttled much too heavy.

CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/file.c       |   11 +++++++----
 fs/btrfs/ioctl.c      |    6 ++++--
 fs/btrfs/relocation.c |    6 ++++--
 3 files changed, 15 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/btrfs/file.c	2010-12-13 21:45:55.000000000 +0800
+++ linux-next/fs/btrfs/file.c	2010-12-13 21:46:19.000000000 +0800
@@ -762,7 +762,8 @@ out:
 static noinline int prepare_pages(struct btrfs_root *root, struct file *file,
 			 struct page **pages, size_t num_pages,
 			 loff_t pos, unsigned long first_index,
-			 unsigned long last_index, size_t write_bytes)
+			 unsigned long last_index, size_t write_bytes,
+			 int *nr_dirtied)
 {
 	struct extent_state *cached_state = NULL;
 	int i;
@@ -825,7 +826,8 @@ again:
 				     GFP_NOFS);
 	}
 	for (i = 0; i < num_pages; i++) {
-		clear_page_dirty_for_io(pages[i]);
+		if (!clear_page_dirty_for_io(pages[i]))
+			(*nr_dirtied)++;
 		set_page_extent_mapped(pages[i]);
 		WARN_ON(!PageLocked(pages[i]));
 	}
@@ -966,6 +968,7 @@ static ssize_t btrfs_file_aio_write(stru
 					 offset);
 		size_t num_pages = (write_bytes + PAGE_CACHE_SIZE - 1) >>
 					PAGE_CACHE_SHIFT;
+		int nr_dirtied = 0;
 
 		WARN_ON(num_pages > nrptrs);
 		memset(pages, 0, sizeof(struct page *) * nrptrs);
@@ -976,7 +979,7 @@ static ssize_t btrfs_file_aio_write(stru
 
 		ret = prepare_pages(root, file, pages, num_pages,
 				    pos, first_index, last_index,
-				    write_bytes);
+				    write_bytes, &nr_dirtied);
 		if (ret) {
 			btrfs_delalloc_release_space(inode, write_bytes);
 			goto out;
@@ -1000,7 +1003,7 @@ static ssize_t btrfs_file_aio_write(stru
 						 pos + write_bytes - 1);
 		} else {
 			balance_dirty_pages_ratelimited_nr(inode->i_mapping,
-							   num_pages);
+							   nr_dirtied);
 			if (num_pages <
 			    (root->leafsize >> PAGE_CACHE_SHIFT) + 1)
 				btrfs_btree_balance_dirty(root, 1);
--- linux-next.orig/fs/btrfs/ioctl.c	2010-12-13 21:45:55.000000000 +0800
+++ linux-next/fs/btrfs/ioctl.c	2010-12-13 21:46:19.000000000 +0800
@@ -647,6 +647,7 @@ static int btrfs_defrag_file(struct file
 	u64 skip = 0;
 	u64 defrag_end = 0;
 	unsigned long i;
+	int dirtied;
 	int ret;
 
 	if (inode->i_size == 0)
@@ -751,7 +752,7 @@ again:
 
 		btrfs_set_extent_delalloc(inode, page_start, page_end, NULL);
 		ClearPageChecked(page);
-		set_page_dirty(page);
+		dirtied = set_page_dirty(page);
 		unlock_extent(io_tree, page_start, page_end, GFP_NOFS);
 
 loop_unlock:
@@ -759,7 +760,8 @@ loop_unlock:
 		page_cache_release(page);
 		mutex_unlock(&inode->i_mutex);
 
-		balance_dirty_pages_ratelimited_nr(inode->i_mapping, 1);
+		if (dirtied)
+			balance_dirty_pages_ratelimited_nr(inode->i_mapping, 1);
 		i++;
 	}
 
--- linux-next.orig/fs/btrfs/relocation.c	2010-12-13 21:45:55.000000000 +0800
+++ linux-next/fs/btrfs/relocation.c	2010-12-13 21:46:19.000000000 +0800
@@ -2894,6 +2894,7 @@ static int relocate_file_extent_cluster(
 	struct file_ra_state *ra;
 	int nr = 0;
 	int ret = 0;
+	int dirtied;
 
 	if (!cluster->nr)
 		return 0;
@@ -2970,7 +2971,7 @@ static int relocate_file_extent_cluster(
 		}
 
 		btrfs_set_extent_delalloc(inode, page_start, page_end, NULL);
-		set_page_dirty(page);
+		dirtied = set_page_dirty(page);
 
 		unlock_extent(&BTRFS_I(inode)->io_tree,
 			      page_start, page_end, GFP_NOFS);
@@ -2978,7 +2979,8 @@ static int relocate_file_extent_cluster(
 		page_cache_release(page);
 
 		index++;
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		if (dirtied)
+			balance_dirty_pages_ratelimited(inode->i_mapping);
 		btrfs_throttle(BTRFS_I(inode)->root);
 	}
 	WARN_ON(nr != cluster->nr);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 24/35] btrfs: dont call balance_dirty_pages_ratelimited() on already dirty pages
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Chris Mason, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: btrfs-fix-balance-size.patch --]
[-- Type: text/plain, Size: 4150 bytes --]

When doing 1KB sequential writes to the same page,
balance_dirty_pages_ratelimited() should be called once instead of 4
times. Failing to do so will make all tasks throttled much too heavy.

CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/file.c       |   11 +++++++----
 fs/btrfs/ioctl.c      |    6 ++++--
 fs/btrfs/relocation.c |    6 ++++--
 3 files changed, 15 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/btrfs/file.c	2010-12-13 21:45:55.000000000 +0800
+++ linux-next/fs/btrfs/file.c	2010-12-13 21:46:19.000000000 +0800
@@ -762,7 +762,8 @@ out:
 static noinline int prepare_pages(struct btrfs_root *root, struct file *file,
 			 struct page **pages, size_t num_pages,
 			 loff_t pos, unsigned long first_index,
-			 unsigned long last_index, size_t write_bytes)
+			 unsigned long last_index, size_t write_bytes,
+			 int *nr_dirtied)
 {
 	struct extent_state *cached_state = NULL;
 	int i;
@@ -825,7 +826,8 @@ again:
 				     GFP_NOFS);
 	}
 	for (i = 0; i < num_pages; i++) {
-		clear_page_dirty_for_io(pages[i]);
+		if (!clear_page_dirty_for_io(pages[i]))
+			(*nr_dirtied)++;
 		set_page_extent_mapped(pages[i]);
 		WARN_ON(!PageLocked(pages[i]));
 	}
@@ -966,6 +968,7 @@ static ssize_t btrfs_file_aio_write(stru
 					 offset);
 		size_t num_pages = (write_bytes + PAGE_CACHE_SIZE - 1) >>
 					PAGE_CACHE_SHIFT;
+		int nr_dirtied = 0;
 
 		WARN_ON(num_pages > nrptrs);
 		memset(pages, 0, sizeof(struct page *) * nrptrs);
@@ -976,7 +979,7 @@ static ssize_t btrfs_file_aio_write(stru
 
 		ret = prepare_pages(root, file, pages, num_pages,
 				    pos, first_index, last_index,
-				    write_bytes);
+				    write_bytes, &nr_dirtied);
 		if (ret) {
 			btrfs_delalloc_release_space(inode, write_bytes);
 			goto out;
@@ -1000,7 +1003,7 @@ static ssize_t btrfs_file_aio_write(stru
 						 pos + write_bytes - 1);
 		} else {
 			balance_dirty_pages_ratelimited_nr(inode->i_mapping,
-							   num_pages);
+							   nr_dirtied);
 			if (num_pages <
 			    (root->leafsize >> PAGE_CACHE_SHIFT) + 1)
 				btrfs_btree_balance_dirty(root, 1);
--- linux-next.orig/fs/btrfs/ioctl.c	2010-12-13 21:45:55.000000000 +0800
+++ linux-next/fs/btrfs/ioctl.c	2010-12-13 21:46:19.000000000 +0800
@@ -647,6 +647,7 @@ static int btrfs_defrag_file(struct file
 	u64 skip = 0;
 	u64 defrag_end = 0;
 	unsigned long i;
+	int dirtied;
 	int ret;
 
 	if (inode->i_size == 0)
@@ -751,7 +752,7 @@ again:
 
 		btrfs_set_extent_delalloc(inode, page_start, page_end, NULL);
 		ClearPageChecked(page);
-		set_page_dirty(page);
+		dirtied = set_page_dirty(page);
 		unlock_extent(io_tree, page_start, page_end, GFP_NOFS);
 
 loop_unlock:
@@ -759,7 +760,8 @@ loop_unlock:
 		page_cache_release(page);
 		mutex_unlock(&inode->i_mutex);
 
-		balance_dirty_pages_ratelimited_nr(inode->i_mapping, 1);
+		if (dirtied)
+			balance_dirty_pages_ratelimited_nr(inode->i_mapping, 1);
 		i++;
 	}
 
--- linux-next.orig/fs/btrfs/relocation.c	2010-12-13 21:45:55.000000000 +0800
+++ linux-next/fs/btrfs/relocation.c	2010-12-13 21:46:19.000000000 +0800
@@ -2894,6 +2894,7 @@ static int relocate_file_extent_cluster(
 	struct file_ra_state *ra;
 	int nr = 0;
 	int ret = 0;
+	int dirtied;
 
 	if (!cluster->nr)
 		return 0;
@@ -2970,7 +2971,7 @@ static int relocate_file_extent_cluster(
 		}
 
 		btrfs_set_extent_delalloc(inode, page_start, page_end, NULL);
-		set_page_dirty(page);
+		dirtied = set_page_dirty(page);
 
 		unlock_extent(&BTRFS_I(inode)->io_tree,
 			      page_start, page_end, GFP_NOFS);
@@ -2978,7 +2979,8 @@ static int relocate_file_extent_cluster(
 		page_cache_release(page);
 
 		index++;
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		if (dirtied)
+			balance_dirty_pages_ratelimited(inode->i_mapping);
 		btrfs_throttle(BTRFS_I(inode)->root);
 	}
 	WARN_ON(nr != cluster->nr);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 25/35] btrfs: lower the dirty balacing rate limit
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:47   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Chris Mason, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: btrfs-limit-nr-dirtied.patch --]
[-- Type: text/plain, Size: 1018 bytes --]

Call balance_dirty_pages_ratelimit_nr() on every 16 pages dirtied.

Experiments show that larger intervals (in the original code) can
easily make the bdi dirty limit exceeded on 100 concurrent dd.

CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/file.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

--- linux-next.orig/fs/btrfs/file.c	2010-12-13 21:46:19.000000000 +0800
+++ linux-next/fs/btrfs/file.c	2010-12-13 21:46:20.000000000 +0800
@@ -924,9 +924,8 @@ static ssize_t btrfs_file_aio_write(stru
 	}
 
 	iov_iter_init(&i, iov, nr_segs, count, num_written);
-	nrptrs = min((iov_iter_count(&i) + PAGE_CACHE_SIZE - 1) /
-		     PAGE_CACHE_SIZE, PAGE_CACHE_SIZE /
-		     (sizeof(struct page *)));
+	nrptrs = min(DIV_ROUND_UP(iov_iter_count(&i), PAGE_CACHE_SIZE),
+		     min(16UL, PAGE_CACHE_SIZE / (sizeof(struct page *))));
 	pages = kmalloc(nrptrs * sizeof(struct page *), GFP_KERNEL);
 
 	/* generic_write_checks can change our pos */



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 25/35] btrfs: lower the dirty balacing rate limit
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Chris Mason, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: btrfs-limit-nr-dirtied.patch --]
[-- Type: text/plain, Size: 1314 bytes --]

Call balance_dirty_pages_ratelimit_nr() on every 16 pages dirtied.

Experiments show that larger intervals (in the original code) can
easily make the bdi dirty limit exceeded on 100 concurrent dd.

CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/file.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

--- linux-next.orig/fs/btrfs/file.c	2010-12-13 21:46:19.000000000 +0800
+++ linux-next/fs/btrfs/file.c	2010-12-13 21:46:20.000000000 +0800
@@ -924,9 +924,8 @@ static ssize_t btrfs_file_aio_write(stru
 	}
 
 	iov_iter_init(&i, iov, nr_segs, count, num_written);
-	nrptrs = min((iov_iter_count(&i) + PAGE_CACHE_SIZE - 1) /
-		     PAGE_CACHE_SIZE, PAGE_CACHE_SIZE /
-		     (sizeof(struct page *)));
+	nrptrs = min(DIV_ROUND_UP(iov_iter_count(&i), PAGE_CACHE_SIZE),
+		     min(16UL, PAGE_CACHE_SIZE / (sizeof(struct page *))));
 	pages = kmalloc(nrptrs * sizeof(struct page *), GFP_KERNEL);
 
 	/* generic_write_checks can change our pos */


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 25/35] btrfs: lower the dirty balacing rate limit
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Chris Mason, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: btrfs-limit-nr-dirtied.patch --]
[-- Type: text/plain, Size: 1314 bytes --]

Call balance_dirty_pages_ratelimit_nr() on every 16 pages dirtied.

Experiments show that larger intervals (in the original code) can
easily make the bdi dirty limit exceeded on 100 concurrent dd.

CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/file.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

--- linux-next.orig/fs/btrfs/file.c	2010-12-13 21:46:19.000000000 +0800
+++ linux-next/fs/btrfs/file.c	2010-12-13 21:46:20.000000000 +0800
@@ -924,9 +924,8 @@ static ssize_t btrfs_file_aio_write(stru
 	}
 
 	iov_iter_init(&i, iov, nr_segs, count, num_written);
-	nrptrs = min((iov_iter_count(&i) + PAGE_CACHE_SIZE - 1) /
-		     PAGE_CACHE_SIZE, PAGE_CACHE_SIZE /
-		     (sizeof(struct page *)));
+	nrptrs = min(DIV_ROUND_UP(iov_iter_count(&i), PAGE_CACHE_SIZE),
+		     min(16UL, PAGE_CACHE_SIZE / (sizeof(struct page *))));
 	pages = kmalloc(nrptrs * sizeof(struct page *), GFP_KERNEL);
 
 	/* generic_write_checks can change our pos */


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 26/35] btrfs: wait on too many nr_async_bios
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:47   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: btrfs-nr_async_bios-wait.patch --]
[-- Type: text/plain, Size: 1433 bytes --]

Tests show that btrfs is repeatedly moving _all_ PG_dirty pages into
PG_writeback state. It's desirable to have some limit on the number of
writeback pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/disk-io.c |    7 +++++++
 1 file changed, 7 insertions(+)

before patch:
	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-08-21-30/vmstat-dirty-300.png

after patch:
	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-08-21-14/vmstat-dirty-300.png

--- linux-next.orig/fs/btrfs/disk-io.c	2010-12-13 21:45:55.000000000 +0800
+++ linux-next/fs/btrfs/disk-io.c	2010-12-13 21:46:20.000000000 +0800
@@ -590,6 +590,7 @@ int btrfs_wq_submit_bio(struct btrfs_fs_
 			extent_submit_bio_hook_t *submit_bio_done)
 {
 	struct async_submit_bio *async;
+	int limit;
 
 	async = kmalloc(sizeof(*async), GFP_NOFS);
 	if (!async)
@@ -617,6 +618,12 @@ int btrfs_wq_submit_bio(struct btrfs_fs_
 
 	btrfs_queue_worker(&fs_info->workers, &async->work);
 
+	limit = btrfs_async_submit_limit(fs_info);
+
+	if (atomic_read(&fs_info->nr_async_bios) > limit)
+		wait_event(fs_info->async_submit_wait,
+			   (atomic_read(&fs_info->nr_async_bios) < limit));
+
 	while (atomic_read(&fs_info->async_submit_draining) &&
 	      atomic_read(&fs_info->nr_async_submits)) {
 		wait_event(fs_info->async_submit_wait,



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 26/35] btrfs: wait on too many nr_async_bios
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: btrfs-nr_async_bios-wait.patch --]
[-- Type: text/plain, Size: 1729 bytes --]

Tests show that btrfs is repeatedly moving _all_ PG_dirty pages into
PG_writeback state. It's desirable to have some limit on the number of
writeback pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/disk-io.c |    7 +++++++
 1 file changed, 7 insertions(+)

before patch:
	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-08-21-30/vmstat-dirty-300.png

after patch:
	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-08-21-14/vmstat-dirty-300.png

--- linux-next.orig/fs/btrfs/disk-io.c	2010-12-13 21:45:55.000000000 +0800
+++ linux-next/fs/btrfs/disk-io.c	2010-12-13 21:46:20.000000000 +0800
@@ -590,6 +590,7 @@ int btrfs_wq_submit_bio(struct btrfs_fs_
 			extent_submit_bio_hook_t *submit_bio_done)
 {
 	struct async_submit_bio *async;
+	int limit;
 
 	async = kmalloc(sizeof(*async), GFP_NOFS);
 	if (!async)
@@ -617,6 +618,12 @@ int btrfs_wq_submit_bio(struct btrfs_fs_
 
 	btrfs_queue_worker(&fs_info->workers, &async->work);
 
+	limit = btrfs_async_submit_limit(fs_info);
+
+	if (atomic_read(&fs_info->nr_async_bios) > limit)
+		wait_event(fs_info->async_submit_wait,
+			   (atomic_read(&fs_info->nr_async_bios) < limit));
+
 	while (atomic_read(&fs_info->async_submit_draining) &&
 	      atomic_read(&fs_info->nr_async_submits)) {
 		wait_event(fs_info->async_submit_wait,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 26/35] btrfs: wait on too many nr_async_bios
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: btrfs-nr_async_bios-wait.patch --]
[-- Type: text/plain, Size: 1729 bytes --]

Tests show that btrfs is repeatedly moving _all_ PG_dirty pages into
PG_writeback state. It's desirable to have some limit on the number of
writeback pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/disk-io.c |    7 +++++++
 1 file changed, 7 insertions(+)

before patch:
	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-08-21-30/vmstat-dirty-300.png

after patch:
	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-08-21-14/vmstat-dirty-300.png

--- linux-next.orig/fs/btrfs/disk-io.c	2010-12-13 21:45:55.000000000 +0800
+++ linux-next/fs/btrfs/disk-io.c	2010-12-13 21:46:20.000000000 +0800
@@ -590,6 +590,7 @@ int btrfs_wq_submit_bio(struct btrfs_fs_
 			extent_submit_bio_hook_t *submit_bio_done)
 {
 	struct async_submit_bio *async;
+	int limit;
 
 	async = kmalloc(sizeof(*async), GFP_NOFS);
 	if (!async)
@@ -617,6 +618,12 @@ int btrfs_wq_submit_bio(struct btrfs_fs_
 
 	btrfs_queue_worker(&fs_info->workers, &async->work);
 
+	limit = btrfs_async_submit_limit(fs_info);
+
+	if (atomic_read(&fs_info->nr_async_bios) > limit)
+		wait_event(fs_info->async_submit_wait,
+			   (atomic_read(&fs_info->nr_async_bios) < limit));
+
 	while (atomic_read(&fs_info->async_submit_draining) &&
 	      atomic_read(&fs_info->nr_async_submits)) {
 		wait_event(fs_info->async_submit_wait,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 27/35] nfs: livelock prevention is now done in VFS
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:47   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: nfs-revert-livelock-72cb77f4a5ac.patch --]
[-- Type: text/plain, Size: 2762 bytes --]

This reverts commit 72cb77f4a5 ("NFS: Throttle page dirtying while we're
flushing to disk"). The two problems it tries to address

- sync live lock
- out of order writes

are now all addressed in the VFS

- PAGECACHE_TAG_TOWRITE prevents sync live lock
- IO-less balance_dirty_pages() avoids concurrent writes

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/file.c          |    9 ---------
 fs/nfs/write.c         |   11 -----------
 include/linux/nfs_fs.h |    1 -
 3 files changed, 21 deletions(-)

--- linux-next.orig/fs/nfs/file.c	2010-12-13 21:45:55.000000000 +0800
+++ linux-next/fs/nfs/file.c	2010-12-13 21:46:20.000000000 +0800
@@ -392,15 +392,6 @@ static int nfs_write_begin(struct file *
 			   IOMODE_RW);
 
 start:
-	/*
-	 * Prevent starvation issues if someone is doing a consistency
-	 * sync-to-disk
-	 */
-	ret = wait_on_bit(&NFS_I(mapping->host)->flags, NFS_INO_FLUSHING,
-			nfs_wait_bit_killable, TASK_KILLABLE);
-	if (ret)
-		return ret;
-
 	page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:45:55.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:20.000000000 +0800
@@ -337,26 +337,15 @@ static int nfs_writepages_callback(struc
 int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
-	unsigned long *bitlock = &NFS_I(inode)->flags;
 	struct nfs_pageio_descriptor pgio;
 	int err;
 
-	/* Stop dirtying of new pages while we sync */
-	err = wait_on_bit_lock(bitlock, NFS_INO_FLUSHING,
-			nfs_wait_bit_killable, TASK_KILLABLE);
-	if (err)
-		goto out_err;
-
 	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
 
 	nfs_pageio_init_write(&pgio, inode, wb_priority(wbc));
 	err = write_cache_pages(mapping, wbc, nfs_writepages_callback, &pgio);
 	nfs_pageio_complete(&pgio);
 
-	clear_bit_unlock(NFS_INO_FLUSHING, bitlock);
-	smp_mb__after_clear_bit();
-	wake_up_bit(bitlock, NFS_INO_FLUSHING);
-
 	if (err < 0)
 		goto out_err;
 	err = pgio.pg_error;
--- linux-next.orig/include/linux/nfs_fs.h	2010-12-13 21:45:55.000000000 +0800
+++ linux-next/include/linux/nfs_fs.h	2010-12-13 21:46:20.000000000 +0800
@@ -216,7 +216,6 @@ struct nfs_inode {
 #define NFS_INO_STALE		(1)		/* possible stale inode */
 #define NFS_INO_ACL_LRU_SET	(2)		/* Inode is on the LRU list */
 #define NFS_INO_MOUNTPOINT	(3)		/* inode is remote mountpoint */
-#define NFS_INO_FLUSHING	(4)		/* inode is flushing out data */
 #define NFS_INO_FSCACHE		(5)		/* inode can be cached by FS-Cache */
 #define NFS_INO_FSCACHE_LOCK	(6)		/* FS-Cache cookie management lock */
 #define NFS_INO_COMMIT		(7)		/* inode is committing unstable writes */



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 27/35] nfs: livelock prevention is now done in VFS
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: nfs-revert-livelock-72cb77f4a5ac.patch --]
[-- Type: text/plain, Size: 3058 bytes --]

This reverts commit 72cb77f4a5 ("NFS: Throttle page dirtying while we're
flushing to disk"). The two problems it tries to address

- sync live lock
- out of order writes

are now all addressed in the VFS

- PAGECACHE_TAG_TOWRITE prevents sync live lock
- IO-less balance_dirty_pages() avoids concurrent writes

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/file.c          |    9 ---------
 fs/nfs/write.c         |   11 -----------
 include/linux/nfs_fs.h |    1 -
 3 files changed, 21 deletions(-)

--- linux-next.orig/fs/nfs/file.c	2010-12-13 21:45:55.000000000 +0800
+++ linux-next/fs/nfs/file.c	2010-12-13 21:46:20.000000000 +0800
@@ -392,15 +392,6 @@ static int nfs_write_begin(struct file *
 			   IOMODE_RW);
 
 start:
-	/*
-	 * Prevent starvation issues if someone is doing a consistency
-	 * sync-to-disk
-	 */
-	ret = wait_on_bit(&NFS_I(mapping->host)->flags, NFS_INO_FLUSHING,
-			nfs_wait_bit_killable, TASK_KILLABLE);
-	if (ret)
-		return ret;
-
 	page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:45:55.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:20.000000000 +0800
@@ -337,26 +337,15 @@ static int nfs_writepages_callback(struc
 int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
-	unsigned long *bitlock = &NFS_I(inode)->flags;
 	struct nfs_pageio_descriptor pgio;
 	int err;
 
-	/* Stop dirtying of new pages while we sync */
-	err = wait_on_bit_lock(bitlock, NFS_INO_FLUSHING,
-			nfs_wait_bit_killable, TASK_KILLABLE);
-	if (err)
-		goto out_err;
-
 	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
 
 	nfs_pageio_init_write(&pgio, inode, wb_priority(wbc));
 	err = write_cache_pages(mapping, wbc, nfs_writepages_callback, &pgio);
 	nfs_pageio_complete(&pgio);
 
-	clear_bit_unlock(NFS_INO_FLUSHING, bitlock);
-	smp_mb__after_clear_bit();
-	wake_up_bit(bitlock, NFS_INO_FLUSHING);
-
 	if (err < 0)
 		goto out_err;
 	err = pgio.pg_error;
--- linux-next.orig/include/linux/nfs_fs.h	2010-12-13 21:45:55.000000000 +0800
+++ linux-next/include/linux/nfs_fs.h	2010-12-13 21:46:20.000000000 +0800
@@ -216,7 +216,6 @@ struct nfs_inode {
 #define NFS_INO_STALE		(1)		/* possible stale inode */
 #define NFS_INO_ACL_LRU_SET	(2)		/* Inode is on the LRU list */
 #define NFS_INO_MOUNTPOINT	(3)		/* inode is remote mountpoint */
-#define NFS_INO_FLUSHING	(4)		/* inode is flushing out data */
 #define NFS_INO_FSCACHE		(5)		/* inode can be cached by FS-Cache */
 #define NFS_INO_FSCACHE_LOCK	(6)		/* FS-Cache cookie management lock */
 #define NFS_INO_COMMIT		(7)		/* inode is committing unstable writes */


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 27/35] nfs: livelock prevention is now done in VFS
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: nfs-revert-livelock-72cb77f4a5ac.patch --]
[-- Type: text/plain, Size: 3058 bytes --]

This reverts commit 72cb77f4a5 ("NFS: Throttle page dirtying while we're
flushing to disk"). The two problems it tries to address

- sync live lock
- out of order writes

are now all addressed in the VFS

- PAGECACHE_TAG_TOWRITE prevents sync live lock
- IO-less balance_dirty_pages() avoids concurrent writes

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/file.c          |    9 ---------
 fs/nfs/write.c         |   11 -----------
 include/linux/nfs_fs.h |    1 -
 3 files changed, 21 deletions(-)

--- linux-next.orig/fs/nfs/file.c	2010-12-13 21:45:55.000000000 +0800
+++ linux-next/fs/nfs/file.c	2010-12-13 21:46:20.000000000 +0800
@@ -392,15 +392,6 @@ static int nfs_write_begin(struct file *
 			   IOMODE_RW);
 
 start:
-	/*
-	 * Prevent starvation issues if someone is doing a consistency
-	 * sync-to-disk
-	 */
-	ret = wait_on_bit(&NFS_I(mapping->host)->flags, NFS_INO_FLUSHING,
-			nfs_wait_bit_killable, TASK_KILLABLE);
-	if (ret)
-		return ret;
-
 	page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:45:55.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:20.000000000 +0800
@@ -337,26 +337,15 @@ static int nfs_writepages_callback(struc
 int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
-	unsigned long *bitlock = &NFS_I(inode)->flags;
 	struct nfs_pageio_descriptor pgio;
 	int err;
 
-	/* Stop dirtying of new pages while we sync */
-	err = wait_on_bit_lock(bitlock, NFS_INO_FLUSHING,
-			nfs_wait_bit_killable, TASK_KILLABLE);
-	if (err)
-		goto out_err;
-
 	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
 
 	nfs_pageio_init_write(&pgio, inode, wb_priority(wbc));
 	err = write_cache_pages(mapping, wbc, nfs_writepages_callback, &pgio);
 	nfs_pageio_complete(&pgio);
 
-	clear_bit_unlock(NFS_INO_FLUSHING, bitlock);
-	smp_mb__after_clear_bit();
-	wake_up_bit(bitlock, NFS_INO_FLUSHING);
-
 	if (err < 0)
 		goto out_err;
 	err = pgio.pg_error;
--- linux-next.orig/include/linux/nfs_fs.h	2010-12-13 21:45:55.000000000 +0800
+++ linux-next/include/linux/nfs_fs.h	2010-12-13 21:46:20.000000000 +0800
@@ -216,7 +216,6 @@ struct nfs_inode {
 #define NFS_INO_STALE		(1)		/* possible stale inode */
 #define NFS_INO_ACL_LRU_SET	(2)		/* Inode is on the LRU list */
 #define NFS_INO_MOUNTPOINT	(3)		/* inode is remote mountpoint */
-#define NFS_INO_FLUSHING	(4)		/* inode is flushing out data */
 #define NFS_INO_FSCACHE		(5)		/* inode can be cached by FS-Cache */
 #define NFS_INO_FSCACHE_LOCK	(6)		/* FS-Cache cookie management lock */
 #define NFS_INO_COMMIT		(7)		/* inode is committing unstable writes */


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 28/35] nfs: writeback pages wait queue
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:47   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Jens Axboe, Chris Mason, Peter Zijlstra,
	Trond Myklebust, Wu Fengguang, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-nfs-request-queue.patch --]
[-- Type: text/plain, Size: 5854 bytes --]

The generic writeback routines are departing from congestion_wait()
in preference of get_request_wait(), aka. waiting on the block queues.

Introduce the missing writeback wait queue for NFS, otherwise its
writeback pages will grow out of control, exhausting all PG_dirty pages.

CC: Jens Axboe <axboe@kernel.dk>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/client.c           |    2 
 fs/nfs/write.c            |   93 +++++++++++++++++++++++++++++++-----
 include/linux/nfs_fs_sb.h |    1 
 3 files changed, 85 insertions(+), 11 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:20.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:21.000000000 +0800
@@ -185,11 +185,68 @@ static int wb_priority(struct writeback_
  * NFS congestion control
  */
 
+#define NFS_WAIT_PAGES	(1024L >> (PAGE_SHIFT - 10))
 int nfs_congestion_kb;
 
-#define NFS_CONGESTION_ON_THRESH 	(nfs_congestion_kb >> (PAGE_SHIFT-10))
-#define NFS_CONGESTION_OFF_THRESH	\
-	(NFS_CONGESTION_ON_THRESH - (NFS_CONGESTION_ON_THRESH >> 2))
+/*
+ * SYNC requests will block on (2*limit) and wakeup on (2*limit-NFS_WAIT_PAGES)
+ * ASYNC requests will block on (limit) and wakeup on (limit - NFS_WAIT_PAGES)
+ * In this way SYNC writes will never be blocked by ASYNC ones.
+ */
+
+static void nfs_set_congested(long nr, struct backing_dev_info *bdi)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+
+	if (nr > limit && !test_bit(BDI_async_congested, &bdi->state))
+		set_bdi_congested(bdi, BLK_RW_ASYNC);
+	else if (nr > 2 * limit && !test_bit(BDI_sync_congested, &bdi->state))
+		set_bdi_congested(bdi, BLK_RW_SYNC);
+}
+
+static void nfs_wait_contested(int is_sync,
+			       struct backing_dev_info *bdi,
+			       wait_queue_head_t *wqh)
+{
+	int waitbit = is_sync ? BDI_sync_congested : BDI_async_congested;
+	DEFINE_WAIT(wait);
+
+	if (!test_bit(waitbit, &bdi->state))
+		return;
+
+	for (;;) {
+		prepare_to_wait(&wqh[is_sync], &wait, TASK_UNINTERRUPTIBLE);
+		if (!test_bit(waitbit, &bdi->state))
+			break;
+
+		io_schedule();
+	}
+	finish_wait(&wqh[is_sync], &wait);
+}
+
+static void nfs_wakeup_congested(long nr,
+				 struct backing_dev_info *bdi,
+				 wait_queue_head_t *wqh)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+
+	if (nr < 2 * limit - min(limit / 8, NFS_WAIT_PAGES)) {
+		if (test_bit(BDI_sync_congested, &bdi->state)) {
+			clear_bdi_congested(bdi, BLK_RW_SYNC);
+			smp_mb__after_clear_bit();
+		}
+		if (waitqueue_active(&wqh[BLK_RW_SYNC]))
+			wake_up(&wqh[BLK_RW_SYNC]);
+	}
+	if (nr < limit - min(limit / 8, NFS_WAIT_PAGES)) {
+		if (test_bit(BDI_async_congested, &bdi->state)) {
+			clear_bdi_congested(bdi, BLK_RW_ASYNC);
+			smp_mb__after_clear_bit();
+		}
+		if (waitqueue_active(&wqh[BLK_RW_ASYNC]))
+			wake_up(&wqh[BLK_RW_ASYNC]);
+	}
+}
 
 static int nfs_set_page_writeback(struct page *page)
 {
@@ -200,11 +257,8 @@ static int nfs_set_page_writeback(struct
 		struct nfs_server *nfss = NFS_SERVER(inode);
 
 		page_cache_get(page);
-		if (atomic_long_inc_return(&nfss->writeback) >
-				NFS_CONGESTION_ON_THRESH) {
-			set_bdi_congested(&nfss->backing_dev_info,
-						BLK_RW_ASYNC);
-		}
+		nfs_set_congested(atomic_long_inc_return(&nfss->writeback),
+				  &nfss->backing_dev_info);
 	}
 	return ret;
 }
@@ -216,8 +270,10 @@ static void nfs_end_page_writeback(struc
 
 	end_page_writeback(page);
 	page_cache_release(page);
-	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
-		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+
+	nfs_wakeup_congested(atomic_long_dec_return(&nfss->writeback),
+			     &nfss->backing_dev_info,
+			     nfss->writeback_wait);
 }
 
 static struct nfs_page *nfs_find_and_lock_request(struct page *page, bool nonblock)
@@ -318,19 +374,34 @@ static int nfs_writepage_locked(struct p
 
 int nfs_writepage(struct page *page, struct writeback_control *wbc)
 {
+	struct inode *inode = page->mapping->host;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	int ret;
 
 	ret = nfs_writepage_locked(page, wbc);
 	unlock_page(page);
+
+	nfs_wait_contested(wbc->sync_mode == WB_SYNC_ALL,
+			   &nfss->backing_dev_info,
+			   nfss->writeback_wait);
+
 	return ret;
 }
 
-static int nfs_writepages_callback(struct page *page, struct writeback_control *wbc, void *data)
+static int nfs_writepages_callback(struct page *page,
+				   struct writeback_control *wbc, void *data)
 {
+	struct inode *inode = page->mapping->host;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	int ret;
 
 	ret = nfs_do_writepage(page, wbc, data);
 	unlock_page(page);
+
+	nfs_wait_contested(wbc->sync_mode == WB_SYNC_ALL,
+			   &nfss->backing_dev_info,
+			   nfss->writeback_wait);
+
 	return ret;
 }
 
--- linux-next.orig/include/linux/nfs_fs_sb.h	2010-12-13 21:45:54.000000000 +0800
+++ linux-next/include/linux/nfs_fs_sb.h	2010-12-13 21:46:21.000000000 +0800
@@ -106,6 +106,7 @@ struct nfs_server {
 	struct nfs_iostats __percpu *io_stats;	/* I/O statistics */
 	struct backing_dev_info	backing_dev_info;
 	atomic_long_t		writeback;	/* number of writeback pages */
+	wait_queue_head_t	writeback_wait[2];
 	int			flags;		/* various flags */
 	unsigned int		caps;		/* server capabilities */
 	unsigned int		rsize;		/* read size */
--- linux-next.orig/fs/nfs/client.c	2010-12-13 21:45:54.000000000 +0800
+++ linux-next/fs/nfs/client.c	2010-12-13 21:46:21.000000000 +0800
@@ -1006,6 +1006,8 @@ static struct nfs_server *nfs_alloc_serv
 	INIT_LIST_HEAD(&server->master_link);
 
 	atomic_set(&server->active, 0);
+	init_waitqueue_head(&server->writeback_wait[BLK_RW_SYNC]);
+	init_waitqueue_head(&server->writeback_wait[BLK_RW_ASYNC]);
 
 	server->io_stats = nfs_alloc_iostats();
 	if (!server->io_stats) {



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 28/35] nfs: writeback pages wait queue
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Jens Axboe, Chris Mason, Peter Zijlstra,
	Trond Myklebust, Wu Fengguang, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-nfs-request-queue.patch --]
[-- Type: text/plain, Size: 6150 bytes --]

The generic writeback routines are departing from congestion_wait()
in preference of get_request_wait(), aka. waiting on the block queues.

Introduce the missing writeback wait queue for NFS, otherwise its
writeback pages will grow out of control, exhausting all PG_dirty pages.

CC: Jens Axboe <axboe@kernel.dk>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/client.c           |    2 
 fs/nfs/write.c            |   93 +++++++++++++++++++++++++++++++-----
 include/linux/nfs_fs_sb.h |    1 
 3 files changed, 85 insertions(+), 11 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:20.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:21.000000000 +0800
@@ -185,11 +185,68 @@ static int wb_priority(struct writeback_
  * NFS congestion control
  */
 
+#define NFS_WAIT_PAGES	(1024L >> (PAGE_SHIFT - 10))
 int nfs_congestion_kb;
 
-#define NFS_CONGESTION_ON_THRESH 	(nfs_congestion_kb >> (PAGE_SHIFT-10))
-#define NFS_CONGESTION_OFF_THRESH	\
-	(NFS_CONGESTION_ON_THRESH - (NFS_CONGESTION_ON_THRESH >> 2))
+/*
+ * SYNC requests will block on (2*limit) and wakeup on (2*limit-NFS_WAIT_PAGES)
+ * ASYNC requests will block on (limit) and wakeup on (limit - NFS_WAIT_PAGES)
+ * In this way SYNC writes will never be blocked by ASYNC ones.
+ */
+
+static void nfs_set_congested(long nr, struct backing_dev_info *bdi)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+
+	if (nr > limit && !test_bit(BDI_async_congested, &bdi->state))
+		set_bdi_congested(bdi, BLK_RW_ASYNC);
+	else if (nr > 2 * limit && !test_bit(BDI_sync_congested, &bdi->state))
+		set_bdi_congested(bdi, BLK_RW_SYNC);
+}
+
+static void nfs_wait_contested(int is_sync,
+			       struct backing_dev_info *bdi,
+			       wait_queue_head_t *wqh)
+{
+	int waitbit = is_sync ? BDI_sync_congested : BDI_async_congested;
+	DEFINE_WAIT(wait);
+
+	if (!test_bit(waitbit, &bdi->state))
+		return;
+
+	for (;;) {
+		prepare_to_wait(&wqh[is_sync], &wait, TASK_UNINTERRUPTIBLE);
+		if (!test_bit(waitbit, &bdi->state))
+			break;
+
+		io_schedule();
+	}
+	finish_wait(&wqh[is_sync], &wait);
+}
+
+static void nfs_wakeup_congested(long nr,
+				 struct backing_dev_info *bdi,
+				 wait_queue_head_t *wqh)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+
+	if (nr < 2 * limit - min(limit / 8, NFS_WAIT_PAGES)) {
+		if (test_bit(BDI_sync_congested, &bdi->state)) {
+			clear_bdi_congested(bdi, BLK_RW_SYNC);
+			smp_mb__after_clear_bit();
+		}
+		if (waitqueue_active(&wqh[BLK_RW_SYNC]))
+			wake_up(&wqh[BLK_RW_SYNC]);
+	}
+	if (nr < limit - min(limit / 8, NFS_WAIT_PAGES)) {
+		if (test_bit(BDI_async_congested, &bdi->state)) {
+			clear_bdi_congested(bdi, BLK_RW_ASYNC);
+			smp_mb__after_clear_bit();
+		}
+		if (waitqueue_active(&wqh[BLK_RW_ASYNC]))
+			wake_up(&wqh[BLK_RW_ASYNC]);
+	}
+}
 
 static int nfs_set_page_writeback(struct page *page)
 {
@@ -200,11 +257,8 @@ static int nfs_set_page_writeback(struct
 		struct nfs_server *nfss = NFS_SERVER(inode);
 
 		page_cache_get(page);
-		if (atomic_long_inc_return(&nfss->writeback) >
-				NFS_CONGESTION_ON_THRESH) {
-			set_bdi_congested(&nfss->backing_dev_info,
-						BLK_RW_ASYNC);
-		}
+		nfs_set_congested(atomic_long_inc_return(&nfss->writeback),
+				  &nfss->backing_dev_info);
 	}
 	return ret;
 }
@@ -216,8 +270,10 @@ static void nfs_end_page_writeback(struc
 
 	end_page_writeback(page);
 	page_cache_release(page);
-	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
-		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+
+	nfs_wakeup_congested(atomic_long_dec_return(&nfss->writeback),
+			     &nfss->backing_dev_info,
+			     nfss->writeback_wait);
 }
 
 static struct nfs_page *nfs_find_and_lock_request(struct page *page, bool nonblock)
@@ -318,19 +374,34 @@ static int nfs_writepage_locked(struct p
 
 int nfs_writepage(struct page *page, struct writeback_control *wbc)
 {
+	struct inode *inode = page->mapping->host;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	int ret;
 
 	ret = nfs_writepage_locked(page, wbc);
 	unlock_page(page);
+
+	nfs_wait_contested(wbc->sync_mode == WB_SYNC_ALL,
+			   &nfss->backing_dev_info,
+			   nfss->writeback_wait);
+
 	return ret;
 }
 
-static int nfs_writepages_callback(struct page *page, struct writeback_control *wbc, void *data)
+static int nfs_writepages_callback(struct page *page,
+				   struct writeback_control *wbc, void *data)
 {
+	struct inode *inode = page->mapping->host;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	int ret;
 
 	ret = nfs_do_writepage(page, wbc, data);
 	unlock_page(page);
+
+	nfs_wait_contested(wbc->sync_mode == WB_SYNC_ALL,
+			   &nfss->backing_dev_info,
+			   nfss->writeback_wait);
+
 	return ret;
 }
 
--- linux-next.orig/include/linux/nfs_fs_sb.h	2010-12-13 21:45:54.000000000 +0800
+++ linux-next/include/linux/nfs_fs_sb.h	2010-12-13 21:46:21.000000000 +0800
@@ -106,6 +106,7 @@ struct nfs_server {
 	struct nfs_iostats __percpu *io_stats;	/* I/O statistics */
 	struct backing_dev_info	backing_dev_info;
 	atomic_long_t		writeback;	/* number of writeback pages */
+	wait_queue_head_t	writeback_wait[2];
 	int			flags;		/* various flags */
 	unsigned int		caps;		/* server capabilities */
 	unsigned int		rsize;		/* read size */
--- linux-next.orig/fs/nfs/client.c	2010-12-13 21:45:54.000000000 +0800
+++ linux-next/fs/nfs/client.c	2010-12-13 21:46:21.000000000 +0800
@@ -1006,6 +1006,8 @@ static struct nfs_server *nfs_alloc_serv
 	INIT_LIST_HEAD(&server->master_link);
 
 	atomic_set(&server->active, 0);
+	init_waitqueue_head(&server->writeback_wait[BLK_RW_SYNC]);
+	init_waitqueue_head(&server->writeback_wait[BLK_RW_ASYNC]);
 
 	server->io_stats = nfs_alloc_iostats();
 	if (!server->io_stats) {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 28/35] nfs: writeback pages wait queue
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Jens Axboe, Chris Mason, Peter Zijlstra,
	Trond Myklebust, Wu Fengguang, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-nfs-request-queue.patch --]
[-- Type: text/plain, Size: 6150 bytes --]

The generic writeback routines are departing from congestion_wait()
in preference of get_request_wait(), aka. waiting on the block queues.

Introduce the missing writeback wait queue for NFS, otherwise its
writeback pages will grow out of control, exhausting all PG_dirty pages.

CC: Jens Axboe <axboe@kernel.dk>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/client.c           |    2 
 fs/nfs/write.c            |   93 +++++++++++++++++++++++++++++++-----
 include/linux/nfs_fs_sb.h |    1 
 3 files changed, 85 insertions(+), 11 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:20.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:21.000000000 +0800
@@ -185,11 +185,68 @@ static int wb_priority(struct writeback_
  * NFS congestion control
  */
 
+#define NFS_WAIT_PAGES	(1024L >> (PAGE_SHIFT - 10))
 int nfs_congestion_kb;
 
-#define NFS_CONGESTION_ON_THRESH 	(nfs_congestion_kb >> (PAGE_SHIFT-10))
-#define NFS_CONGESTION_OFF_THRESH	\
-	(NFS_CONGESTION_ON_THRESH - (NFS_CONGESTION_ON_THRESH >> 2))
+/*
+ * SYNC requests will block on (2*limit) and wakeup on (2*limit-NFS_WAIT_PAGES)
+ * ASYNC requests will block on (limit) and wakeup on (limit - NFS_WAIT_PAGES)
+ * In this way SYNC writes will never be blocked by ASYNC ones.
+ */
+
+static void nfs_set_congested(long nr, struct backing_dev_info *bdi)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+
+	if (nr > limit && !test_bit(BDI_async_congested, &bdi->state))
+		set_bdi_congested(bdi, BLK_RW_ASYNC);
+	else if (nr > 2 * limit && !test_bit(BDI_sync_congested, &bdi->state))
+		set_bdi_congested(bdi, BLK_RW_SYNC);
+}
+
+static void nfs_wait_contested(int is_sync,
+			       struct backing_dev_info *bdi,
+			       wait_queue_head_t *wqh)
+{
+	int waitbit = is_sync ? BDI_sync_congested : BDI_async_congested;
+	DEFINE_WAIT(wait);
+
+	if (!test_bit(waitbit, &bdi->state))
+		return;
+
+	for (;;) {
+		prepare_to_wait(&wqh[is_sync], &wait, TASK_UNINTERRUPTIBLE);
+		if (!test_bit(waitbit, &bdi->state))
+			break;
+
+		io_schedule();
+	}
+	finish_wait(&wqh[is_sync], &wait);
+}
+
+static void nfs_wakeup_congested(long nr,
+				 struct backing_dev_info *bdi,
+				 wait_queue_head_t *wqh)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+
+	if (nr < 2 * limit - min(limit / 8, NFS_WAIT_PAGES)) {
+		if (test_bit(BDI_sync_congested, &bdi->state)) {
+			clear_bdi_congested(bdi, BLK_RW_SYNC);
+			smp_mb__after_clear_bit();
+		}
+		if (waitqueue_active(&wqh[BLK_RW_SYNC]))
+			wake_up(&wqh[BLK_RW_SYNC]);
+	}
+	if (nr < limit - min(limit / 8, NFS_WAIT_PAGES)) {
+		if (test_bit(BDI_async_congested, &bdi->state)) {
+			clear_bdi_congested(bdi, BLK_RW_ASYNC);
+			smp_mb__after_clear_bit();
+		}
+		if (waitqueue_active(&wqh[BLK_RW_ASYNC]))
+			wake_up(&wqh[BLK_RW_ASYNC]);
+	}
+}
 
 static int nfs_set_page_writeback(struct page *page)
 {
@@ -200,11 +257,8 @@ static int nfs_set_page_writeback(struct
 		struct nfs_server *nfss = NFS_SERVER(inode);
 
 		page_cache_get(page);
-		if (atomic_long_inc_return(&nfss->writeback) >
-				NFS_CONGESTION_ON_THRESH) {
-			set_bdi_congested(&nfss->backing_dev_info,
-						BLK_RW_ASYNC);
-		}
+		nfs_set_congested(atomic_long_inc_return(&nfss->writeback),
+				  &nfss->backing_dev_info);
 	}
 	return ret;
 }
@@ -216,8 +270,10 @@ static void nfs_end_page_writeback(struc
 
 	end_page_writeback(page);
 	page_cache_release(page);
-	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
-		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+
+	nfs_wakeup_congested(atomic_long_dec_return(&nfss->writeback),
+			     &nfss->backing_dev_info,
+			     nfss->writeback_wait);
 }
 
 static struct nfs_page *nfs_find_and_lock_request(struct page *page, bool nonblock)
@@ -318,19 +374,34 @@ static int nfs_writepage_locked(struct p
 
 int nfs_writepage(struct page *page, struct writeback_control *wbc)
 {
+	struct inode *inode = page->mapping->host;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	int ret;
 
 	ret = nfs_writepage_locked(page, wbc);
 	unlock_page(page);
+
+	nfs_wait_contested(wbc->sync_mode == WB_SYNC_ALL,
+			   &nfss->backing_dev_info,
+			   nfss->writeback_wait);
+
 	return ret;
 }
 
-static int nfs_writepages_callback(struct page *page, struct writeback_control *wbc, void *data)
+static int nfs_writepages_callback(struct page *page,
+				   struct writeback_control *wbc, void *data)
 {
+	struct inode *inode = page->mapping->host;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	int ret;
 
 	ret = nfs_do_writepage(page, wbc, data);
 	unlock_page(page);
+
+	nfs_wait_contested(wbc->sync_mode == WB_SYNC_ALL,
+			   &nfss->backing_dev_info,
+			   nfss->writeback_wait);
+
 	return ret;
 }
 
--- linux-next.orig/include/linux/nfs_fs_sb.h	2010-12-13 21:45:54.000000000 +0800
+++ linux-next/include/linux/nfs_fs_sb.h	2010-12-13 21:46:21.000000000 +0800
@@ -106,6 +106,7 @@ struct nfs_server {
 	struct nfs_iostats __percpu *io_stats;	/* I/O statistics */
 	struct backing_dev_info	backing_dev_info;
 	atomic_long_t		writeback;	/* number of writeback pages */
+	wait_queue_head_t	writeback_wait[2];
 	int			flags;		/* various flags */
 	unsigned int		caps;		/* server capabilities */
 	unsigned int		rsize;		/* read size */
--- linux-next.orig/fs/nfs/client.c	2010-12-13 21:45:54.000000000 +0800
+++ linux-next/fs/nfs/client.c	2010-12-13 21:46:21.000000000 +0800
@@ -1006,6 +1006,8 @@ static struct nfs_server *nfs_alloc_serv
 	INIT_LIST_HEAD(&server->master_link);
 
 	atomic_set(&server->active, 0);
+	init_waitqueue_head(&server->writeback_wait[BLK_RW_SYNC]);
+	init_waitqueue_head(&server->writeback_wait[BLK_RW_ASYNC]);
 
 	server->io_stats = nfs_alloc_iostats();
 	if (!server->io_stats) {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 29/35] nfs: in-commit pages accounting and wait queue
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:47   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-nfs-in-commit.patch --]
[-- Type: text/plain, Size: 5850 bytes --]

When doing 10+ concurrent dd's, I observed very bumpy commits submission
(partly because the dd's are started at the same time, and hence reached
4MB to-commit pages at the same time). Basically we rely on the server
to complete and return write/commit requests, and want both to progress
smoothly and not consume too many pages. The write request wait queue is
not enough as it's mainly network bounded. So add another commit request
wait queue. Only async writes need to sleep on this queue.

cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/client.c           |    1 
 fs/nfs/write.c            |   51 ++++++++++++++++++++++++++++++------
 include/linux/nfs_fs_sb.h |    2 +
 3 files changed, 46 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:21.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:21.000000000 +0800
@@ -516,7 +516,7 @@ nfs_mark_request_commit(struct nfs_page 
 }
 
 static int
-nfs_clear_request_commit(struct nfs_page *req)
+nfs_clear_request_commit(struct inode *inode, struct nfs_page *req)
 {
 	struct page *page = req->wb_page;
 
@@ -554,7 +554,7 @@ nfs_mark_request_commit(struct nfs_page 
 }
 
 static inline int
-nfs_clear_request_commit(struct nfs_page *req)
+nfs_clear_request_commit(struct inode *inode, struct nfs_page *req)
 {
 	return 0;
 }
@@ -599,8 +599,10 @@ nfs_scan_commit(struct inode *inode, str
 		return 0;
 
 	ret = nfs_scan_list(nfsi, dst, idx_start, npages, NFS_PAGE_TAG_COMMIT);
-	if (ret > 0)
+	if (ret > 0) {
 		nfsi->ncommit -= ret;
+		atomic_long_add(ret, &NFS_SERVER(inode)->in_commit);
+	}
 	if (nfs_need_commit(NFS_I(inode)))
 		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 	return ret;
@@ -668,7 +670,7 @@ static struct nfs_page *nfs_try_to_updat
 		spin_lock(&inode->i_lock);
 	}
 
-	if (nfs_clear_request_commit(req) &&
+	if (nfs_clear_request_commit(inode, req) &&
 			radix_tree_tag_clear(&NFS_I(inode)->nfs_page_tree,
 				req->wb_index, NFS_PAGE_TAG_COMMIT) != NULL)
 		NFS_I(inode)->ncommit--;
@@ -1271,6 +1273,34 @@ int nfs_writeback_done(struct rpc_task *
 
 
 #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4)
+static void nfs_commit_wait(struct nfs_server *nfss)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+	DEFINE_WAIT(wait);
+
+	if (atomic_long_read(&nfss->in_commit) < limit)
+		return;
+
+	for (;;) {
+		prepare_to_wait(&nfss->in_commit_wait, &wait,
+				TASK_UNINTERRUPTIBLE);
+		if (atomic_long_read(&nfss->in_commit) < limit)
+			break;
+
+		io_schedule();
+	}
+	finish_wait(&nfss->in_commit_wait, &wait);
+}
+
+static void nfs_commit_wakeup(struct nfs_server *nfss)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+
+	if (atomic_long_read(&nfss->in_commit) < limit - limit / 8 &&
+	    waitqueue_active(&nfss->in_commit_wait))
+		wake_up(&nfss->in_commit_wait);
+}
+
 static int nfs_commit_set_lock(struct nfs_inode *nfsi, int may_wait)
 {
 	if (!test_and_set_bit(NFS_INO_COMMIT, &nfsi->flags))
@@ -1376,6 +1406,7 @@ nfs_commit_list(struct inode *inode, str
 		req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
+		atomic_long_dec(&NFS_SERVER(inode)->in_commit);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
 				BDI_RECLAIMABLE);
@@ -1409,7 +1440,8 @@ static void nfs_commit_release(void *cal
 	while (!list_empty(&data->pages)) {
 		req = nfs_list_entry(data->pages.next);
 		nfs_list_remove_request(req);
-		nfs_clear_request_commit(req);
+		nfs_clear_request_commit(data->inode, req);
+		atomic_long_dec(&NFS_SERVER(data->inode)->in_commit);
 
 		dprintk("NFS:       commit (%s/%lld %d@%lld)",
 			req->wb_context->path.dentry->d_inode->i_sb->s_id,
@@ -1438,6 +1470,7 @@ static void nfs_commit_release(void *cal
 		nfs_clear_page_tag_locked(req);
 	}
 	nfs_commit_clear_lock(NFS_I(data->inode));
+	nfs_commit_wakeup(NFS_SERVER(data->inode));
 	nfs_commitdata_release(calldata);
 }
 
@@ -1452,11 +1485,13 @@ static const struct rpc_call_ops nfs_com
 int nfs_commit_inode(struct inode *inode, int how)
 {
 	LIST_HEAD(head);
-	int may_wait = how & FLUSH_SYNC;
+	int sync = how & FLUSH_SYNC;
 	int res = 0;
 
-	if (!nfs_commit_set_lock(NFS_I(inode), may_wait))
+	if (!nfs_commit_set_lock(NFS_I(inode), sync))
 		goto out_mark_dirty;
+	if (!sync)
+		nfs_commit_wait(NFS_SERVER(inode));
 	spin_lock(&inode->i_lock);
 	res = nfs_scan_commit(inode, &head, 0, 0);
 	spin_unlock(&inode->i_lock);
@@ -1464,7 +1499,7 @@ int nfs_commit_inode(struct inode *inode
 		int error = nfs_commit_list(inode, &head, how);
 		if (error < 0)
 			return error;
-		if (may_wait)
+		if (sync)
 			wait_on_bit(&NFS_I(inode)->flags, NFS_INO_COMMIT,
 					nfs_wait_bit_killable,
 					TASK_KILLABLE);
--- linux-next.orig/include/linux/nfs_fs_sb.h	2010-12-13 21:46:21.000000000 +0800
+++ linux-next/include/linux/nfs_fs_sb.h	2010-12-13 21:46:21.000000000 +0800
@@ -107,6 +107,8 @@ struct nfs_server {
 	struct backing_dev_info	backing_dev_info;
 	atomic_long_t		writeback;	/* number of writeback pages */
 	wait_queue_head_t	writeback_wait[2];
+	atomic_long_t		in_commit;	/* number of in-commit pages */
+	wait_queue_head_t	in_commit_wait;
 	int			flags;		/* various flags */
 	unsigned int		caps;		/* server capabilities */
 	unsigned int		rsize;		/* read size */
--- linux-next.orig/fs/nfs/client.c	2010-12-13 21:46:21.000000000 +0800
+++ linux-next/fs/nfs/client.c	2010-12-13 21:46:21.000000000 +0800
@@ -1008,6 +1008,7 @@ static struct nfs_server *nfs_alloc_serv
 	atomic_set(&server->active, 0);
 	init_waitqueue_head(&server->writeback_wait[BLK_RW_SYNC]);
 	init_waitqueue_head(&server->writeback_wait[BLK_RW_ASYNC]);
+	init_waitqueue_head(&server->in_commit_wait);
 
 	server->io_stats = nfs_alloc_iostats();
 	if (!server->io_stats) {



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 29/35] nfs: in-commit pages accounting and wait queue
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-nfs-in-commit.patch --]
[-- Type: text/plain, Size: 6146 bytes --]

When doing 10+ concurrent dd's, I observed very bumpy commits submission
(partly because the dd's are started at the same time, and hence reached
4MB to-commit pages at the same time). Basically we rely on the server
to complete and return write/commit requests, and want both to progress
smoothly and not consume too many pages. The write request wait queue is
not enough as it's mainly network bounded. So add another commit request
wait queue. Only async writes need to sleep on this queue.

cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/client.c           |    1 
 fs/nfs/write.c            |   51 ++++++++++++++++++++++++++++++------
 include/linux/nfs_fs_sb.h |    2 +
 3 files changed, 46 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:21.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:21.000000000 +0800
@@ -516,7 +516,7 @@ nfs_mark_request_commit(struct nfs_page 
 }
 
 static int
-nfs_clear_request_commit(struct nfs_page *req)
+nfs_clear_request_commit(struct inode *inode, struct nfs_page *req)
 {
 	struct page *page = req->wb_page;
 
@@ -554,7 +554,7 @@ nfs_mark_request_commit(struct nfs_page 
 }
 
 static inline int
-nfs_clear_request_commit(struct nfs_page *req)
+nfs_clear_request_commit(struct inode *inode, struct nfs_page *req)
 {
 	return 0;
 }
@@ -599,8 +599,10 @@ nfs_scan_commit(struct inode *inode, str
 		return 0;
 
 	ret = nfs_scan_list(nfsi, dst, idx_start, npages, NFS_PAGE_TAG_COMMIT);
-	if (ret > 0)
+	if (ret > 0) {
 		nfsi->ncommit -= ret;
+		atomic_long_add(ret, &NFS_SERVER(inode)->in_commit);
+	}
 	if (nfs_need_commit(NFS_I(inode)))
 		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 	return ret;
@@ -668,7 +670,7 @@ static struct nfs_page *nfs_try_to_updat
 		spin_lock(&inode->i_lock);
 	}
 
-	if (nfs_clear_request_commit(req) &&
+	if (nfs_clear_request_commit(inode, req) &&
 			radix_tree_tag_clear(&NFS_I(inode)->nfs_page_tree,
 				req->wb_index, NFS_PAGE_TAG_COMMIT) != NULL)
 		NFS_I(inode)->ncommit--;
@@ -1271,6 +1273,34 @@ int nfs_writeback_done(struct rpc_task *
 
 
 #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4)
+static void nfs_commit_wait(struct nfs_server *nfss)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+	DEFINE_WAIT(wait);
+
+	if (atomic_long_read(&nfss->in_commit) < limit)
+		return;
+
+	for (;;) {
+		prepare_to_wait(&nfss->in_commit_wait, &wait,
+				TASK_UNINTERRUPTIBLE);
+		if (atomic_long_read(&nfss->in_commit) < limit)
+			break;
+
+		io_schedule();
+	}
+	finish_wait(&nfss->in_commit_wait, &wait);
+}
+
+static void nfs_commit_wakeup(struct nfs_server *nfss)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+
+	if (atomic_long_read(&nfss->in_commit) < limit - limit / 8 &&
+	    waitqueue_active(&nfss->in_commit_wait))
+		wake_up(&nfss->in_commit_wait);
+}
+
 static int nfs_commit_set_lock(struct nfs_inode *nfsi, int may_wait)
 {
 	if (!test_and_set_bit(NFS_INO_COMMIT, &nfsi->flags))
@@ -1376,6 +1406,7 @@ nfs_commit_list(struct inode *inode, str
 		req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
+		atomic_long_dec(&NFS_SERVER(inode)->in_commit);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
 				BDI_RECLAIMABLE);
@@ -1409,7 +1440,8 @@ static void nfs_commit_release(void *cal
 	while (!list_empty(&data->pages)) {
 		req = nfs_list_entry(data->pages.next);
 		nfs_list_remove_request(req);
-		nfs_clear_request_commit(req);
+		nfs_clear_request_commit(data->inode, req);
+		atomic_long_dec(&NFS_SERVER(data->inode)->in_commit);
 
 		dprintk("NFS:       commit (%s/%lld %d@%lld)",
 			req->wb_context->path.dentry->d_inode->i_sb->s_id,
@@ -1438,6 +1470,7 @@ static void nfs_commit_release(void *cal
 		nfs_clear_page_tag_locked(req);
 	}
 	nfs_commit_clear_lock(NFS_I(data->inode));
+	nfs_commit_wakeup(NFS_SERVER(data->inode));
 	nfs_commitdata_release(calldata);
 }
 
@@ -1452,11 +1485,13 @@ static const struct rpc_call_ops nfs_com
 int nfs_commit_inode(struct inode *inode, int how)
 {
 	LIST_HEAD(head);
-	int may_wait = how & FLUSH_SYNC;
+	int sync = how & FLUSH_SYNC;
 	int res = 0;
 
-	if (!nfs_commit_set_lock(NFS_I(inode), may_wait))
+	if (!nfs_commit_set_lock(NFS_I(inode), sync))
 		goto out_mark_dirty;
+	if (!sync)
+		nfs_commit_wait(NFS_SERVER(inode));
 	spin_lock(&inode->i_lock);
 	res = nfs_scan_commit(inode, &head, 0, 0);
 	spin_unlock(&inode->i_lock);
@@ -1464,7 +1499,7 @@ int nfs_commit_inode(struct inode *inode
 		int error = nfs_commit_list(inode, &head, how);
 		if (error < 0)
 			return error;
-		if (may_wait)
+		if (sync)
 			wait_on_bit(&NFS_I(inode)->flags, NFS_INO_COMMIT,
 					nfs_wait_bit_killable,
 					TASK_KILLABLE);
--- linux-next.orig/include/linux/nfs_fs_sb.h	2010-12-13 21:46:21.000000000 +0800
+++ linux-next/include/linux/nfs_fs_sb.h	2010-12-13 21:46:21.000000000 +0800
@@ -107,6 +107,8 @@ struct nfs_server {
 	struct backing_dev_info	backing_dev_info;
 	atomic_long_t		writeback;	/* number of writeback pages */
 	wait_queue_head_t	writeback_wait[2];
+	atomic_long_t		in_commit;	/* number of in-commit pages */
+	wait_queue_head_t	in_commit_wait;
 	int			flags;		/* various flags */
 	unsigned int		caps;		/* server capabilities */
 	unsigned int		rsize;		/* read size */
--- linux-next.orig/fs/nfs/client.c	2010-12-13 21:46:21.000000000 +0800
+++ linux-next/fs/nfs/client.c	2010-12-13 21:46:21.000000000 +0800
@@ -1008,6 +1008,7 @@ static struct nfs_server *nfs_alloc_serv
 	atomic_set(&server->active, 0);
 	init_waitqueue_head(&server->writeback_wait[BLK_RW_SYNC]);
 	init_waitqueue_head(&server->writeback_wait[BLK_RW_ASYNC]);
+	init_waitqueue_head(&server->in_commit_wait);
 
 	server->io_stats = nfs_alloc_iostats();
 	if (!server->io_stats) {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 29/35] nfs: in-commit pages accounting and wait queue
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-nfs-in-commit.patch --]
[-- Type: text/plain, Size: 6146 bytes --]

When doing 10+ concurrent dd's, I observed very bumpy commits submission
(partly because the dd's are started at the same time, and hence reached
4MB to-commit pages at the same time). Basically we rely on the server
to complete and return write/commit requests, and want both to progress
smoothly and not consume too many pages. The write request wait queue is
not enough as it's mainly network bounded. So add another commit request
wait queue. Only async writes need to sleep on this queue.

cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/client.c           |    1 
 fs/nfs/write.c            |   51 ++++++++++++++++++++++++++++++------
 include/linux/nfs_fs_sb.h |    2 +
 3 files changed, 46 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:21.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:21.000000000 +0800
@@ -516,7 +516,7 @@ nfs_mark_request_commit(struct nfs_page 
 }
 
 static int
-nfs_clear_request_commit(struct nfs_page *req)
+nfs_clear_request_commit(struct inode *inode, struct nfs_page *req)
 {
 	struct page *page = req->wb_page;
 
@@ -554,7 +554,7 @@ nfs_mark_request_commit(struct nfs_page 
 }
 
 static inline int
-nfs_clear_request_commit(struct nfs_page *req)
+nfs_clear_request_commit(struct inode *inode, struct nfs_page *req)
 {
 	return 0;
 }
@@ -599,8 +599,10 @@ nfs_scan_commit(struct inode *inode, str
 		return 0;
 
 	ret = nfs_scan_list(nfsi, dst, idx_start, npages, NFS_PAGE_TAG_COMMIT);
-	if (ret > 0)
+	if (ret > 0) {
 		nfsi->ncommit -= ret;
+		atomic_long_add(ret, &NFS_SERVER(inode)->in_commit);
+	}
 	if (nfs_need_commit(NFS_I(inode)))
 		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 	return ret;
@@ -668,7 +670,7 @@ static struct nfs_page *nfs_try_to_updat
 		spin_lock(&inode->i_lock);
 	}
 
-	if (nfs_clear_request_commit(req) &&
+	if (nfs_clear_request_commit(inode, req) &&
 			radix_tree_tag_clear(&NFS_I(inode)->nfs_page_tree,
 				req->wb_index, NFS_PAGE_TAG_COMMIT) != NULL)
 		NFS_I(inode)->ncommit--;
@@ -1271,6 +1273,34 @@ int nfs_writeback_done(struct rpc_task *
 
 
 #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4)
+static void nfs_commit_wait(struct nfs_server *nfss)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+	DEFINE_WAIT(wait);
+
+	if (atomic_long_read(&nfss->in_commit) < limit)
+		return;
+
+	for (;;) {
+		prepare_to_wait(&nfss->in_commit_wait, &wait,
+				TASK_UNINTERRUPTIBLE);
+		if (atomic_long_read(&nfss->in_commit) < limit)
+			break;
+
+		io_schedule();
+	}
+	finish_wait(&nfss->in_commit_wait, &wait);
+}
+
+static void nfs_commit_wakeup(struct nfs_server *nfss)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+
+	if (atomic_long_read(&nfss->in_commit) < limit - limit / 8 &&
+	    waitqueue_active(&nfss->in_commit_wait))
+		wake_up(&nfss->in_commit_wait);
+}
+
 static int nfs_commit_set_lock(struct nfs_inode *nfsi, int may_wait)
 {
 	if (!test_and_set_bit(NFS_INO_COMMIT, &nfsi->flags))
@@ -1376,6 +1406,7 @@ nfs_commit_list(struct inode *inode, str
 		req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
+		atomic_long_dec(&NFS_SERVER(inode)->in_commit);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
 				BDI_RECLAIMABLE);
@@ -1409,7 +1440,8 @@ static void nfs_commit_release(void *cal
 	while (!list_empty(&data->pages)) {
 		req = nfs_list_entry(data->pages.next);
 		nfs_list_remove_request(req);
-		nfs_clear_request_commit(req);
+		nfs_clear_request_commit(data->inode, req);
+		atomic_long_dec(&NFS_SERVER(data->inode)->in_commit);
 
 		dprintk("NFS:       commit (%s/%lld %d@%lld)",
 			req->wb_context->path.dentry->d_inode->i_sb->s_id,
@@ -1438,6 +1470,7 @@ static void nfs_commit_release(void *cal
 		nfs_clear_page_tag_locked(req);
 	}
 	nfs_commit_clear_lock(NFS_I(data->inode));
+	nfs_commit_wakeup(NFS_SERVER(data->inode));
 	nfs_commitdata_release(calldata);
 }
 
@@ -1452,11 +1485,13 @@ static const struct rpc_call_ops nfs_com
 int nfs_commit_inode(struct inode *inode, int how)
 {
 	LIST_HEAD(head);
-	int may_wait = how & FLUSH_SYNC;
+	int sync = how & FLUSH_SYNC;
 	int res = 0;
 
-	if (!nfs_commit_set_lock(NFS_I(inode), may_wait))
+	if (!nfs_commit_set_lock(NFS_I(inode), sync))
 		goto out_mark_dirty;
+	if (!sync)
+		nfs_commit_wait(NFS_SERVER(inode));
 	spin_lock(&inode->i_lock);
 	res = nfs_scan_commit(inode, &head, 0, 0);
 	spin_unlock(&inode->i_lock);
@@ -1464,7 +1499,7 @@ int nfs_commit_inode(struct inode *inode
 		int error = nfs_commit_list(inode, &head, how);
 		if (error < 0)
 			return error;
-		if (may_wait)
+		if (sync)
 			wait_on_bit(&NFS_I(inode)->flags, NFS_INO_COMMIT,
 					nfs_wait_bit_killable,
 					TASK_KILLABLE);
--- linux-next.orig/include/linux/nfs_fs_sb.h	2010-12-13 21:46:21.000000000 +0800
+++ linux-next/include/linux/nfs_fs_sb.h	2010-12-13 21:46:21.000000000 +0800
@@ -107,6 +107,8 @@ struct nfs_server {
 	struct backing_dev_info	backing_dev_info;
 	atomic_long_t		writeback;	/* number of writeback pages */
 	wait_queue_head_t	writeback_wait[2];
+	atomic_long_t		in_commit;	/* number of in-commit pages */
+	wait_queue_head_t	in_commit_wait;
 	int			flags;		/* various flags */
 	unsigned int		caps;		/* server capabilities */
 	unsigned int		rsize;		/* read size */
--- linux-next.orig/fs/nfs/client.c	2010-12-13 21:46:21.000000000 +0800
+++ linux-next/fs/nfs/client.c	2010-12-13 21:46:21.000000000 +0800
@@ -1008,6 +1008,7 @@ static struct nfs_server *nfs_alloc_serv
 	atomic_set(&server->active, 0);
 	init_waitqueue_head(&server->writeback_wait[BLK_RW_SYNC]);
 	init_waitqueue_head(&server->writeback_wait[BLK_RW_ASYNC]);
+	init_waitqueue_head(&server->in_commit_wait);
 
 	server->io_stats = nfs_alloc_iostats();
 	if (!server->io_stats) {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 30/35] nfs: heuristics to avoid commit
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:47   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-nfs-should-commit.patch --]
[-- Type: text/plain, Size: 1903 bytes --]

The heuristics introduced by commit 420e3646 ("NFS: Reduce the number of
unnecessary COMMIT calls") do not work well for large inodes being
actively written to.

Refine the criterion to
- it has gone quiet (all data transfered to server)
- has accumulated >= 4MB data to commit (so it will be large IO)
- too few active commits (hence active IO) in the server

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |   31 ++++++++++++++++++++++++++-----
 1 file changed, 26 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:21.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:21.000000000 +0800
@@ -1518,17 +1518,38 @@ out_mark_dirty:
 	return res;
 }
 
-static int nfs_commit_unstable_pages(struct inode *inode, struct writeback_control *wbc)
+static bool nfs_should_commit(struct inode *inode,
+			      struct writeback_control *wbc)
 {
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	struct nfs_inode *nfsi = NFS_I(inode);
+	unsigned long npages = nfsi->npages;
+	unsigned long to_commit = nfsi->ncommit;
+	unsigned long in_commit = atomic_long_read(&nfss->in_commit);
+
+	/* no more active writes */
+	if (to_commit == npages)
+		return true;
+
+	/* big enough */
+	if (to_commit >= MIN_WRITEBACK_PAGES)
+		return true;
+
+	/* active commits drop low: kick more IO for the server disk */
+	if (to_commit > in_commit / 2)
+		return true;
+
+	return false;
+}
+
+static int nfs_commit_unstable_pages(struct inode *inode,
+				     struct writeback_control *wbc)
+{
 	int flags = FLUSH_SYNC;
 	int ret = 0;
 
 	if (wbc->sync_mode == WB_SYNC_NONE) {
-		/* Don't commit yet if this is a non-blocking flush and there
-		 * are a lot of outstanding writes for this mapping.
-		 */
-		if (nfsi->ncommit <= (nfsi->npages >> 1))
+		if (!nfs_should_commit(inode, wbc))
 			goto out_mark_dirty;
 
 		/* don't wait for the COMMIT response */



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 30/35] nfs: heuristics to avoid commit
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-nfs-should-commit.patch --]
[-- Type: text/plain, Size: 2199 bytes --]

The heuristics introduced by commit 420e3646 ("NFS: Reduce the number of
unnecessary COMMIT calls") do not work well for large inodes being
actively written to.

Refine the criterion to
- it has gone quiet (all data transfered to server)
- has accumulated >= 4MB data to commit (so it will be large IO)
- too few active commits (hence active IO) in the server

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |   31 ++++++++++++++++++++++++++-----
 1 file changed, 26 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:21.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:21.000000000 +0800
@@ -1518,17 +1518,38 @@ out_mark_dirty:
 	return res;
 }
 
-static int nfs_commit_unstable_pages(struct inode *inode, struct writeback_control *wbc)
+static bool nfs_should_commit(struct inode *inode,
+			      struct writeback_control *wbc)
 {
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	struct nfs_inode *nfsi = NFS_I(inode);
+	unsigned long npages = nfsi->npages;
+	unsigned long to_commit = nfsi->ncommit;
+	unsigned long in_commit = atomic_long_read(&nfss->in_commit);
+
+	/* no more active writes */
+	if (to_commit == npages)
+		return true;
+
+	/* big enough */
+	if (to_commit >= MIN_WRITEBACK_PAGES)
+		return true;
+
+	/* active commits drop low: kick more IO for the server disk */
+	if (to_commit > in_commit / 2)
+		return true;
+
+	return false;
+}
+
+static int nfs_commit_unstable_pages(struct inode *inode,
+				     struct writeback_control *wbc)
+{
 	int flags = FLUSH_SYNC;
 	int ret = 0;
 
 	if (wbc->sync_mode == WB_SYNC_NONE) {
-		/* Don't commit yet if this is a non-blocking flush and there
-		 * are a lot of outstanding writes for this mapping.
-		 */
-		if (nfsi->ncommit <= (nfsi->npages >> 1))
+		if (!nfs_should_commit(inode, wbc))
 			goto out_mark_dirty;
 
 		/* don't wait for the COMMIT response */


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 30/35] nfs: heuristics to avoid commit
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-nfs-should-commit.patch --]
[-- Type: text/plain, Size: 2199 bytes --]

The heuristics introduced by commit 420e3646 ("NFS: Reduce the number of
unnecessary COMMIT calls") do not work well for large inodes being
actively written to.

Refine the criterion to
- it has gone quiet (all data transfered to server)
- has accumulated >= 4MB data to commit (so it will be large IO)
- too few active commits (hence active IO) in the server

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |   31 ++++++++++++++++++++++++++-----
 1 file changed, 26 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:21.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:21.000000000 +0800
@@ -1518,17 +1518,38 @@ out_mark_dirty:
 	return res;
 }
 
-static int nfs_commit_unstable_pages(struct inode *inode, struct writeback_control *wbc)
+static bool nfs_should_commit(struct inode *inode,
+			      struct writeback_control *wbc)
 {
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	struct nfs_inode *nfsi = NFS_I(inode);
+	unsigned long npages = nfsi->npages;
+	unsigned long to_commit = nfsi->ncommit;
+	unsigned long in_commit = atomic_long_read(&nfss->in_commit);
+
+	/* no more active writes */
+	if (to_commit == npages)
+		return true;
+
+	/* big enough */
+	if (to_commit >= MIN_WRITEBACK_PAGES)
+		return true;
+
+	/* active commits drop low: kick more IO for the server disk */
+	if (to_commit > in_commit / 2)
+		return true;
+
+	return false;
+}
+
+static int nfs_commit_unstable_pages(struct inode *inode,
+				     struct writeback_control *wbc)
+{
 	int flags = FLUSH_SYNC;
 	int ret = 0;
 
 	if (wbc->sync_mode == WB_SYNC_NONE) {
-		/* Don't commit yet if this is a non-blocking flush and there
-		 * are a lot of outstanding writes for this mapping.
-		 */
-		if (nfsi->ncommit <= (nfsi->npages >> 1))
+		if (!nfs_should_commit(inode, wbc))
 			goto out_mark_dirty;
 
 		/* don't wait for the COMMIT response */


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 31/35] nfs: dont change wbc->nr_to_write in write_inode()
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:47   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-nfs-commit-remove-nr_to_write.patch --]
[-- Type: text/plain, Size: 826 bytes --]

It's introduced in commit 420e3646 ("NFS: Reduce the number of
unnecessary COMMIT calls") and seems not necessary.

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |    9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:21.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:22.000000000 +0800
@@ -1557,15 +1557,8 @@ static int nfs_commit_unstable_pages(str
 	}
 
 	ret = nfs_commit_inode(inode, flags);
-	if (ret >= 0) {
-		if (wbc->sync_mode == WB_SYNC_NONE) {
-			if (ret < wbc->nr_to_write)
-				wbc->nr_to_write -= ret;
-			else
-				wbc->nr_to_write = 0;
-		}
+	if (ret >= 0)
 		return 0;
-	}
 out_mark_dirty:
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 	return ret;



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 31/35] nfs: dont change wbc->nr_to_write in write_inode()
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-nfs-commit-remove-nr_to_write.patch --]
[-- Type: text/plain, Size: 1122 bytes --]

It's introduced in commit 420e3646 ("NFS: Reduce the number of
unnecessary COMMIT calls") and seems not necessary.

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |    9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:21.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:22.000000000 +0800
@@ -1557,15 +1557,8 @@ static int nfs_commit_unstable_pages(str
 	}
 
 	ret = nfs_commit_inode(inode, flags);
-	if (ret >= 0) {
-		if (wbc->sync_mode == WB_SYNC_NONE) {
-			if (ret < wbc->nr_to_write)
-				wbc->nr_to_write -= ret;
-			else
-				wbc->nr_to_write = 0;
-		}
+	if (ret >= 0)
 		return 0;
-	}
 out_mark_dirty:
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 	return ret;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 31/35] nfs: dont change wbc->nr_to_write in write_inode()
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-nfs-commit-remove-nr_to_write.patch --]
[-- Type: text/plain, Size: 1122 bytes --]

It's introduced in commit 420e3646 ("NFS: Reduce the number of
unnecessary COMMIT calls") and seems not necessary.

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |    9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:21.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:22.000000000 +0800
@@ -1557,15 +1557,8 @@ static int nfs_commit_unstable_pages(str
 	}
 
 	ret = nfs_commit_inode(inode, flags);
-	if (ret >= 0) {
-		if (wbc->sync_mode == WB_SYNC_NONE) {
-			if (ret < wbc->nr_to_write)
-				wbc->nr_to_write -= ret;
-			else
-				wbc->nr_to_write = 0;
-		}
+	if (ret >= 0)
 		return 0;
-	}
 out_mark_dirty:
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 	return ret;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 32/35] nfs: limit the range of commits
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:47   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: nfs-commit-range.patch --]
[-- Type: text/plain, Size: 2674 bytes --]

Hopefully this will help limit the number of unstable pages to be synced
at one time, more timely return of the commit request and reduce dirty
throttle fluctuations.

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |   20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:22.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:22.000000000 +0800
@@ -1333,7 +1333,7 @@ static void nfs_commitdata_release(void 
  */
 static int nfs_commit_rpcsetup(struct list_head *head,
 		struct nfs_write_data *data,
-		int how)
+		int how, pgoff_t offset, pgoff_t count)
 {
 	struct nfs_page *first = nfs_list_entry(head->next);
 	struct inode *inode = first->wb_context->path.dentry->d_inode;
@@ -1365,8 +1365,8 @@ static int nfs_commit_rpcsetup(struct li
 
 	data->args.fh     = NFS_FH(data->inode);
 	/* Note: we always request a commit of the entire inode */
-	data->args.offset = 0;
-	data->args.count  = 0;
+	data->args.offset = offset;
+	data->args.count  = count;
 	data->args.context = get_nfs_open_context(first->wb_context);
 	data->res.count   = 0;
 	data->res.fattr   = &data->fattr;
@@ -1389,7 +1389,8 @@ static int nfs_commit_rpcsetup(struct li
  * Commit dirty pages
  */
 static int
-nfs_commit_list(struct inode *inode, struct list_head *head, int how)
+nfs_commit_list(struct inode *inode, struct list_head *head, int how,
+		pgoff_t offset, pgoff_t count)
 {
 	struct nfs_write_data	*data;
 	struct nfs_page         *req;
@@ -1400,7 +1401,7 @@ nfs_commit_list(struct inode *inode, str
 		goto out_bad;
 
 	/* Set up the argument struct */
-	return nfs_commit_rpcsetup(head, data, how);
+	return nfs_commit_rpcsetup(head, data, how, offset, count);
  out_bad:
 	while (!list_empty(head)) {
 		req = nfs_list_entry(head->next);
@@ -1485,6 +1486,8 @@ static const struct rpc_call_ops nfs_com
 int nfs_commit_inode(struct inode *inode, int how)
 {
 	LIST_HEAD(head);
+	pgoff_t first_index;
+	pgoff_t last_index;
 	int sync = how & FLUSH_SYNC;
 	int res = 0;
 
@@ -1494,9 +1497,14 @@ int nfs_commit_inode(struct inode *inode
 		nfs_commit_wait(NFS_SERVER(inode));
 	spin_lock(&inode->i_lock);
 	res = nfs_scan_commit(inode, &head, 0, 0);
+	if (res) {
+		first_index = nfs_list_entry(head.next)->wb_index;
+		last_index  = nfs_list_entry(head.prev)->wb_index;
+	}
 	spin_unlock(&inode->i_lock);
 	if (res) {
-		int error = nfs_commit_list(inode, &head, how);
+		int error = nfs_commit_list(inode, &head, how, first_index,
+					    last_index - first_index + 1);
 		if (error < 0)
 			return error;
 		if (sync)



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 32/35] nfs: limit the range of commits
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: nfs-commit-range.patch --]
[-- Type: text/plain, Size: 2970 bytes --]

Hopefully this will help limit the number of unstable pages to be synced
at one time, more timely return of the commit request and reduce dirty
throttle fluctuations.

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |   20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:22.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:22.000000000 +0800
@@ -1333,7 +1333,7 @@ static void nfs_commitdata_release(void 
  */
 static int nfs_commit_rpcsetup(struct list_head *head,
 		struct nfs_write_data *data,
-		int how)
+		int how, pgoff_t offset, pgoff_t count)
 {
 	struct nfs_page *first = nfs_list_entry(head->next);
 	struct inode *inode = first->wb_context->path.dentry->d_inode;
@@ -1365,8 +1365,8 @@ static int nfs_commit_rpcsetup(struct li
 
 	data->args.fh     = NFS_FH(data->inode);
 	/* Note: we always request a commit of the entire inode */
-	data->args.offset = 0;
-	data->args.count  = 0;
+	data->args.offset = offset;
+	data->args.count  = count;
 	data->args.context = get_nfs_open_context(first->wb_context);
 	data->res.count   = 0;
 	data->res.fattr   = &data->fattr;
@@ -1389,7 +1389,8 @@ static int nfs_commit_rpcsetup(struct li
  * Commit dirty pages
  */
 static int
-nfs_commit_list(struct inode *inode, struct list_head *head, int how)
+nfs_commit_list(struct inode *inode, struct list_head *head, int how,
+		pgoff_t offset, pgoff_t count)
 {
 	struct nfs_write_data	*data;
 	struct nfs_page         *req;
@@ -1400,7 +1401,7 @@ nfs_commit_list(struct inode *inode, str
 		goto out_bad;
 
 	/* Set up the argument struct */
-	return nfs_commit_rpcsetup(head, data, how);
+	return nfs_commit_rpcsetup(head, data, how, offset, count);
  out_bad:
 	while (!list_empty(head)) {
 		req = nfs_list_entry(head->next);
@@ -1485,6 +1486,8 @@ static const struct rpc_call_ops nfs_com
 int nfs_commit_inode(struct inode *inode, int how)
 {
 	LIST_HEAD(head);
+	pgoff_t first_index;
+	pgoff_t last_index;
 	int sync = how & FLUSH_SYNC;
 	int res = 0;
 
@@ -1494,9 +1497,14 @@ int nfs_commit_inode(struct inode *inode
 		nfs_commit_wait(NFS_SERVER(inode));
 	spin_lock(&inode->i_lock);
 	res = nfs_scan_commit(inode, &head, 0, 0);
+	if (res) {
+		first_index = nfs_list_entry(head.next)->wb_index;
+		last_index  = nfs_list_entry(head.prev)->wb_index;
+	}
 	spin_unlock(&inode->i_lock);
 	if (res) {
-		int error = nfs_commit_list(inode, &head, how);
+		int error = nfs_commit_list(inode, &head, how, first_index,
+					    last_index - first_index + 1);
 		if (error < 0)
 			return error;
 		if (sync)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 32/35] nfs: limit the range of commits
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: nfs-commit-range.patch --]
[-- Type: text/plain, Size: 2970 bytes --]

Hopefully this will help limit the number of unstable pages to be synced
at one time, more timely return of the commit request and reduce dirty
throttle fluctuations.

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |   20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:22.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:22.000000000 +0800
@@ -1333,7 +1333,7 @@ static void nfs_commitdata_release(void 
  */
 static int nfs_commit_rpcsetup(struct list_head *head,
 		struct nfs_write_data *data,
-		int how)
+		int how, pgoff_t offset, pgoff_t count)
 {
 	struct nfs_page *first = nfs_list_entry(head->next);
 	struct inode *inode = first->wb_context->path.dentry->d_inode;
@@ -1365,8 +1365,8 @@ static int nfs_commit_rpcsetup(struct li
 
 	data->args.fh     = NFS_FH(data->inode);
 	/* Note: we always request a commit of the entire inode */
-	data->args.offset = 0;
-	data->args.count  = 0;
+	data->args.offset = offset;
+	data->args.count  = count;
 	data->args.context = get_nfs_open_context(first->wb_context);
 	data->res.count   = 0;
 	data->res.fattr   = &data->fattr;
@@ -1389,7 +1389,8 @@ static int nfs_commit_rpcsetup(struct li
  * Commit dirty pages
  */
 static int
-nfs_commit_list(struct inode *inode, struct list_head *head, int how)
+nfs_commit_list(struct inode *inode, struct list_head *head, int how,
+		pgoff_t offset, pgoff_t count)
 {
 	struct nfs_write_data	*data;
 	struct nfs_page         *req;
@@ -1400,7 +1401,7 @@ nfs_commit_list(struct inode *inode, str
 		goto out_bad;
 
 	/* Set up the argument struct */
-	return nfs_commit_rpcsetup(head, data, how);
+	return nfs_commit_rpcsetup(head, data, how, offset, count);
  out_bad:
 	while (!list_empty(head)) {
 		req = nfs_list_entry(head->next);
@@ -1485,6 +1486,8 @@ static const struct rpc_call_ops nfs_com
 int nfs_commit_inode(struct inode *inode, int how)
 {
 	LIST_HEAD(head);
+	pgoff_t first_index;
+	pgoff_t last_index;
 	int sync = how & FLUSH_SYNC;
 	int res = 0;
 
@@ -1494,9 +1497,14 @@ int nfs_commit_inode(struct inode *inode
 		nfs_commit_wait(NFS_SERVER(inode));
 	spin_lock(&inode->i_lock);
 	res = nfs_scan_commit(inode, &head, 0, 0);
+	if (res) {
+		first_index = nfs_list_entry(head.next)->wb_index;
+		last_index  = nfs_list_entry(head.prev)->wb_index;
+	}
 	spin_unlock(&inode->i_lock);
 	if (res) {
-		int error = nfs_commit_list(inode, &head, how);
+		int error = nfs_commit_list(inode, &head, how, first_index,
+					    last_index - first_index + 1);
 		if (error < 0)
 			return error;
 		if (sync)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 33/35] nfs: adapt congestion threshold to dirty threshold
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:47   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: nfs-congestion-thresh.patch --]
[-- Type: text/plain, Size: 1577 bytes --]

nfs_congestion_kb is to control the max allowed writeback and in-commit
pages. It's not reasonable for them to outnumber dirty and to-commit
pages. So each of them should not take more than 1/4 dirty threshold.

Considering that nfs_init_writepagecache() is called on fresh boot,
at the time dirty_thresh is much higher than the real dirty limit after
lots of user space memory consumptions, use 1/8 instead.

We might update nfs_congestion_kb when global dirty limit is changed
at runtime, but whatever, do it simple first.

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:22.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:22.000000000 +0800
@@ -1698,6 +1698,9 @@ out:
 
 int __init nfs_init_writepagecache(void)
 {
+	unsigned long background_thresh;
+	unsigned long dirty_thresh;
+
 	nfs_wdata_cachep = kmem_cache_create("nfs_write_data",
 					     sizeof(struct nfs_write_data),
 					     0, SLAB_HWCACHE_ALIGN,
@@ -1735,6 +1738,16 @@ int __init nfs_init_writepagecache(void)
 	if (nfs_congestion_kb > 256*1024)
 		nfs_congestion_kb = 256*1024;
 
+	/*
+	 * Limit to 1/8 dirty threshold, so that writeback+in_commit pages
+	 * won't overnumber dirty+to_commit pages.
+	 */
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+	dirty_thresh <<= PAGE_SHIFT - 10;
+
+	if (nfs_congestion_kb > dirty_thresh / 8)
+		nfs_congestion_kb = dirty_thresh / 8;
+
 	return 0;
 }
 



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 33/35] nfs: adapt congestion threshold to dirty threshold
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: nfs-congestion-thresh.patch --]
[-- Type: text/plain, Size: 1873 bytes --]

nfs_congestion_kb is to control the max allowed writeback and in-commit
pages. It's not reasonable for them to outnumber dirty and to-commit
pages. So each of them should not take more than 1/4 dirty threshold.

Considering that nfs_init_writepagecache() is called on fresh boot,
at the time dirty_thresh is much higher than the real dirty limit after
lots of user space memory consumptions, use 1/8 instead.

We might update nfs_congestion_kb when global dirty limit is changed
at runtime, but whatever, do it simple first.

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:22.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:22.000000000 +0800
@@ -1698,6 +1698,9 @@ out:
 
 int __init nfs_init_writepagecache(void)
 {
+	unsigned long background_thresh;
+	unsigned long dirty_thresh;
+
 	nfs_wdata_cachep = kmem_cache_create("nfs_write_data",
 					     sizeof(struct nfs_write_data),
 					     0, SLAB_HWCACHE_ALIGN,
@@ -1735,6 +1738,16 @@ int __init nfs_init_writepagecache(void)
 	if (nfs_congestion_kb > 256*1024)
 		nfs_congestion_kb = 256*1024;
 
+	/*
+	 * Limit to 1/8 dirty threshold, so that writeback+in_commit pages
+	 * won't overnumber dirty+to_commit pages.
+	 */
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+	dirty_thresh <<= PAGE_SHIFT - 10;
+
+	if (nfs_congestion_kb > dirty_thresh / 8)
+		nfs_congestion_kb = dirty_thresh / 8;
+
 	return 0;
 }
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 33/35] nfs: adapt congestion threshold to dirty threshold
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: nfs-congestion-thresh.patch --]
[-- Type: text/plain, Size: 1873 bytes --]

nfs_congestion_kb is to control the max allowed writeback and in-commit
pages. It's not reasonable for them to outnumber dirty and to-commit
pages. So each of them should not take more than 1/4 dirty threshold.

Considering that nfs_init_writepagecache() is called on fresh boot,
at the time dirty_thresh is much higher than the real dirty limit after
lots of user space memory consumptions, use 1/8 instead.

We might update nfs_congestion_kb when global dirty limit is changed
at runtime, but whatever, do it simple first.

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:22.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:22.000000000 +0800
@@ -1698,6 +1698,9 @@ out:
 
 int __init nfs_init_writepagecache(void)
 {
+	unsigned long background_thresh;
+	unsigned long dirty_thresh;
+
 	nfs_wdata_cachep = kmem_cache_create("nfs_write_data",
 					     sizeof(struct nfs_write_data),
 					     0, SLAB_HWCACHE_ALIGN,
@@ -1735,6 +1738,16 @@ int __init nfs_init_writepagecache(void)
 	if (nfs_congestion_kb > 256*1024)
 		nfs_congestion_kb = 256*1024;
 
+	/*
+	 * Limit to 1/8 dirty threshold, so that writeback+in_commit pages
+	 * won't overnumber dirty+to_commit pages.
+	 */
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+	dirty_thresh <<= PAGE_SHIFT - 10;
+
+	if (nfs_congestion_kb > dirty_thresh / 8)
+		nfs_congestion_kb = dirty_thresh / 8;
+
 	return 0;
 }
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 34/35] nfs: trace nfs_commit_unstable_pages()
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:47   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: nfs-trace-write_inode.patch --]
[-- Type: text/plain, Size: 2490 bytes --]

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c             |   10 ++++--
 include/trace/events/nfs.h |   58 +++++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+), 2 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:22.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:23.000000000 +0800
@@ -29,6 +29,9 @@
 #include "nfs4_fs.h"
 #include "fscache.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/nfs.h>
+
 #define NFSDBG_FACILITY		NFSDBG_PAGECACHE
 
 #define MIN_POOL_WRITE		(32)
@@ -1566,10 +1569,13 @@ static int nfs_commit_unstable_pages(str
 
 	ret = nfs_commit_inode(inode, flags);
 	if (ret >= 0)
-		return 0;
+		goto out;
+
 out_mark_dirty:
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
-	return ret;
+out:
+	trace_nfs_commit_unstable_pages(inode, wbc, flags, ret);
+	return ret >= 0 ? 0 : ret;
 }
 #else
 static int nfs_commit_unstable_pages(struct inode *inode, struct writeback_control *wbc)
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-next/include/trace/events/nfs.h	2010-12-13 21:46:23.000000000 +0800
@@ -0,0 +1,58 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM nfs
+
+#if !defined(_TRACE_NFS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_NFS_H
+
+#include <linux/nfs_fs.h>
+
+
+TRACE_EVENT(nfs_commit_unstable_pages,
+
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 int sync,
+		 int ret
+	),
+
+	TP_ARGS(inode, wbc, sync, ret),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long,	ino)
+		__field(unsigned long,	npages)
+		__field(unsigned long,	in_commit)
+		__field(unsigned long,	write_chunk)
+		__field(int,		sync)
+		__field(int,		ret)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->npages		= NFS_I(inode)->npages;
+		__entry->in_commit	=
+			atomic_long_read(&NFS_SERVER(inode)->in_commit);
+		__entry->write_chunk	= wbc->per_file_limit;
+		__entry->sync		= sync;
+		__entry->ret		= ret;
+	),
+
+	TP_printk("bdi %s: ino=%lu npages=%ld "
+		  "incommit=%lu write_chunk=%lu sync=%d ret=%d",
+		  __entry->name,
+		  __entry->ino,
+		  __entry->npages,
+		  __entry->in_commit,
+		  __entry->write_chunk,
+		  __entry->sync,
+		  __entry->ret
+	)
+);
+
+
+#endif /* _TRACE_NFS_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 34/35] nfs: trace nfs_commit_unstable_pages()
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: nfs-trace-write_inode.patch --]
[-- Type: text/plain, Size: 2786 bytes --]

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c             |   10 ++++--
 include/trace/events/nfs.h |   58 +++++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+), 2 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:22.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:23.000000000 +0800
@@ -29,6 +29,9 @@
 #include "nfs4_fs.h"
 #include "fscache.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/nfs.h>
+
 #define NFSDBG_FACILITY		NFSDBG_PAGECACHE
 
 #define MIN_POOL_WRITE		(32)
@@ -1566,10 +1569,13 @@ static int nfs_commit_unstable_pages(str
 
 	ret = nfs_commit_inode(inode, flags);
 	if (ret >= 0)
-		return 0;
+		goto out;
+
 out_mark_dirty:
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
-	return ret;
+out:
+	trace_nfs_commit_unstable_pages(inode, wbc, flags, ret);
+	return ret >= 0 ? 0 : ret;
 }
 #else
 static int nfs_commit_unstable_pages(struct inode *inode, struct writeback_control *wbc)
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-next/include/trace/events/nfs.h	2010-12-13 21:46:23.000000000 +0800
@@ -0,0 +1,58 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM nfs
+
+#if !defined(_TRACE_NFS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_NFS_H
+
+#include <linux/nfs_fs.h>
+
+
+TRACE_EVENT(nfs_commit_unstable_pages,
+
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 int sync,
+		 int ret
+	),
+
+	TP_ARGS(inode, wbc, sync, ret),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long,	ino)
+		__field(unsigned long,	npages)
+		__field(unsigned long,	in_commit)
+		__field(unsigned long,	write_chunk)
+		__field(int,		sync)
+		__field(int,		ret)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->npages		= NFS_I(inode)->npages;
+		__entry->in_commit	=
+			atomic_long_read(&NFS_SERVER(inode)->in_commit);
+		__entry->write_chunk	= wbc->per_file_limit;
+		__entry->sync		= sync;
+		__entry->ret		= ret;
+	),
+
+	TP_printk("bdi %s: ino=%lu npages=%ld "
+		  "incommit=%lu write_chunk=%lu sync=%d ret=%d",
+		  __entry->name,
+		  __entry->ino,
+		  __entry->npages,
+		  __entry->in_commit,
+		  __entry->write_chunk,
+		  __entry->sync,
+		  __entry->ret
+	)
+);
+
+
+#endif /* _TRACE_NFS_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 34/35] nfs: trace nfs_commit_unstable_pages()
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: nfs-trace-write_inode.patch --]
[-- Type: text/plain, Size: 2786 bytes --]

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c             |   10 ++++--
 include/trace/events/nfs.h |   58 +++++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+), 2 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:22.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:23.000000000 +0800
@@ -29,6 +29,9 @@
 #include "nfs4_fs.h"
 #include "fscache.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/nfs.h>
+
 #define NFSDBG_FACILITY		NFSDBG_PAGECACHE
 
 #define MIN_POOL_WRITE		(32)
@@ -1566,10 +1569,13 @@ static int nfs_commit_unstable_pages(str
 
 	ret = nfs_commit_inode(inode, flags);
 	if (ret >= 0)
-		return 0;
+		goto out;
+
 out_mark_dirty:
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
-	return ret;
+out:
+	trace_nfs_commit_unstable_pages(inode, wbc, flags, ret);
+	return ret >= 0 ? 0 : ret;
 }
 #else
 static int nfs_commit_unstable_pages(struct inode *inode, struct writeback_control *wbc)
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-next/include/trace/events/nfs.h	2010-12-13 21:46:23.000000000 +0800
@@ -0,0 +1,58 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM nfs
+
+#if !defined(_TRACE_NFS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_NFS_H
+
+#include <linux/nfs_fs.h>
+
+
+TRACE_EVENT(nfs_commit_unstable_pages,
+
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 int sync,
+		 int ret
+	),
+
+	TP_ARGS(inode, wbc, sync, ret),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long,	ino)
+		__field(unsigned long,	npages)
+		__field(unsigned long,	in_commit)
+		__field(unsigned long,	write_chunk)
+		__field(int,		sync)
+		__field(int,		ret)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->npages		= NFS_I(inode)->npages;
+		__entry->in_commit	=
+			atomic_long_read(&NFS_SERVER(inode)->in_commit);
+		__entry->write_chunk	= wbc->per_file_limit;
+		__entry->sync		= sync;
+		__entry->ret		= ret;
+	),
+
+	TP_printk("bdi %s: ino=%lu npages=%ld "
+		  "incommit=%lu write_chunk=%lu sync=%d ret=%d",
+		  __entry->name,
+		  __entry->ino,
+		  __entry->npages,
+		  __entry->in_commit,
+		  __entry->write_chunk,
+		  __entry->sync,
+		  __entry->ret
+	)
+);
+
+
+#endif /* _TRACE_NFS_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 35/35] nfs: trace nfs_commit_release()
  2010-12-13 14:46 ` Wu Fengguang
  (?)
@ 2010-12-13 14:47   ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: trace-nfs-commit-release.patch --]
[-- Type: text/plain, Size: 1527 bytes --]


Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c             |    3 +++
 include/trace/events/nfs.h |   31 +++++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:23.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:23.000000000 +0800
@@ -1475,6 +1475,9 @@ static void nfs_commit_release(void *cal
 	}
 	nfs_commit_clear_lock(NFS_I(data->inode));
 	nfs_commit_wakeup(NFS_SERVER(data->inode));
+	trace_nfs_commit_release(data->inode,
+				 data->args.offset,
+				 data->args.count);
 	nfs_commitdata_release(calldata);
 }
 
--- linux-next.orig/include/trace/events/nfs.h	2010-12-13 21:46:23.000000000 +0800
+++ linux-next/include/trace/events/nfs.h	2010-12-13 21:46:23.000000000 +0800
@@ -51,6 +51,37 @@ TRACE_EVENT(nfs_commit_unstable_pages,
 	)
 );
 
+TRACE_EVENT(nfs_commit_release,
+
+	TP_PROTO(struct inode *inode,
+		 unsigned long offset,
+		 unsigned long len),
+
+	TP_ARGS(inode, offset, len),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long,	ino)
+		__field(unsigned long,	offset)
+		__field(unsigned long,	len)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->offset		= offset;
+		__entry->len		= len;
+	),
+
+	TP_printk("bdi %s: ino=%lu offset=%lu len=%lu",
+		  __entry->name,
+		  __entry->ino,
+		  __entry->offset,
+		  __entry->len
+	)
+);
+
 
 #endif /* _TRACE_NFS_H */
 



^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 35/35] nfs: trace nfs_commit_release()
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: trace-nfs-commit-release.patch --]
[-- Type: text/plain, Size: 1823 bytes --]


Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c             |    3 +++
 include/trace/events/nfs.h |   31 +++++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:23.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:23.000000000 +0800
@@ -1475,6 +1475,9 @@ static void nfs_commit_release(void *cal
 	}
 	nfs_commit_clear_lock(NFS_I(data->inode));
 	nfs_commit_wakeup(NFS_SERVER(data->inode));
+	trace_nfs_commit_release(data->inode,
+				 data->args.offset,
+				 data->args.count);
 	nfs_commitdata_release(calldata);
 }
 
--- linux-next.orig/include/trace/events/nfs.h	2010-12-13 21:46:23.000000000 +0800
+++ linux-next/include/trace/events/nfs.h	2010-12-13 21:46:23.000000000 +0800
@@ -51,6 +51,37 @@ TRACE_EVENT(nfs_commit_unstable_pages,
 	)
 );
 
+TRACE_EVENT(nfs_commit_release,
+
+	TP_PROTO(struct inode *inode,
+		 unsigned long offset,
+		 unsigned long len),
+
+	TP_ARGS(inode, offset, len),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long,	ino)
+		__field(unsigned long,	offset)
+		__field(unsigned long,	len)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->offset		= offset;
+		__entry->len		= len;
+	),
+
+	TP_printk("bdi %s: ino=%lu offset=%lu len=%lu",
+		  __entry->name,
+		  __entry->ino,
+		  __entry->offset,
+		  __entry->len
+	)
+);
+
 
 #endif /* _TRACE_NFS_H */
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH 35/35] nfs: trace nfs_commit_release()
@ 2010-12-13 14:47   ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-13 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: trace-nfs-commit-release.patch --]
[-- Type: text/plain, Size: 1823 bytes --]


Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c             |    3 +++
 include/trace/events/nfs.h |   31 +++++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)

--- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:23.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-13 21:46:23.000000000 +0800
@@ -1475,6 +1475,9 @@ static void nfs_commit_release(void *cal
 	}
 	nfs_commit_clear_lock(NFS_I(data->inode));
 	nfs_commit_wakeup(NFS_SERVER(data->inode));
+	trace_nfs_commit_release(data->inode,
+				 data->args.offset,
+				 data->args.count);
 	nfs_commitdata_release(calldata);
 }
 
--- linux-next.orig/include/trace/events/nfs.h	2010-12-13 21:46:23.000000000 +0800
+++ linux-next/include/trace/events/nfs.h	2010-12-13 21:46:23.000000000 +0800
@@ -51,6 +51,37 @@ TRACE_EVENT(nfs_commit_unstable_pages,
 	)
 );
 
+TRACE_EVENT(nfs_commit_release,
+
+	TP_PROTO(struct inode *inode,
+		 unsigned long offset,
+		 unsigned long len),
+
+	TP_ARGS(inode, offset, len),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long,	ino)
+		__field(unsigned long,	offset)
+		__field(unsigned long,	len)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->offset		= offset;
+		__entry->len		= len;
+	),
+
+	TP_printk("bdi %s: ino=%lu offset=%lu len=%lu",
+		  __entry->name,
+		  __entry->ino,
+		  __entry->offset,
+		  __entry->len
+	)
+);
+
 
 #endif /* _TRACE_NFS_H */
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 16/35] writeback: increase min pause time on concurrent dirtiers
  2010-12-13 14:47   ` Wu Fengguang
  (?)
  (?)
@ 2010-12-13 18:23   ` Valdis.Kletnieks
  2010-12-14  6:51       ` Wu Fengguang
  -1 siblings, 1 reply; 202+ messages in thread
From: Valdis.Kletnieks @ 2010-12-13 18:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Dave Chinner, Christoph Hellwig,
	Trond Myklebust, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: Type: text/plain, Size: 1102 bytes --]

On Mon, 13 Dec 2010 22:47:02 +0800, Wu Fengguang said:
> Target for >60ms pause time when there are 100+ heavy dirtiers per bdi.
> (will average around 100ms given 200ms max pause time)

> --- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
> @@ -659,6 +659,27 @@ static unsigned long max_pause(unsigned 
>  }
>  
>  /*
> + * Scale up pause time for concurrent dirtiers in order to reduce CPU overheads.
> + * But ensure reasonably large [min_pause, max_pause] range size, so that
> + * nr_dirtied_pause (and hence future pause time) can stay reasonably stable.
> + */
> +static unsigned long min_pause(struct backing_dev_info *bdi,
> +			       unsigned long max)
> +{
> +	unsigned long hi = ilog2(bdi->write_bandwidth);
> +	unsigned long lo = ilog2(bdi->throttle_bandwidth);
> +	unsigned long t;
> +
> +	if (lo >= hi)
> +		return 1;
> +
> +	/* (N * 10ms) on 2^N concurrent tasks */
> +	t = (hi - lo) * (10 * HZ) / 1024;

Either I need more caffeine, or the comment doesn't match the code
if HZ != 1000?

[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 30/35] nfs: heuristics to avoid commit
  2010-12-13 14:47   ` Wu Fengguang
@ 2010-12-13 20:53     ` Trond Myklebust
  -1 siblings, 0 replies; 202+ messages in thread
From: Trond Myklebust @ 2010-12-13 20:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	linux-mm, linux-fsdevel, LKML

On Mon, 2010-12-13 at 22:47 +0800, Wu Fengguang wrote:
> plain text document attachment (writeback-nfs-should-commit.patch)
> The heuristics introduced by commit 420e3646 ("NFS: Reduce the number of
> unnecessary COMMIT calls") do not work well for large inodes being
> actively written to.
> 
> Refine the criterion to
> - it has gone quiet (all data transfered to server)
> - has accumulated >= 4MB data to commit (so it will be large IO)
> - too few active commits (hence active IO) in the server

Where does the number 4MB come from? If I'm writing a 4GB file, I
certainly do not want to commit every 4MB; that would make for a total
of 1000 commit requests in addition to the writes. On a 64-bit client
+server both having loads of memory and connected by a decently a fast
network, that can be a significant slowdown...

Most of the time, we really want the server to be managing its dirty
cache entirely independently of the client. The latter should only be
sending the commit when it really needs to free up those pages.

Cheers
  Trond


-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 30/35] nfs: heuristics to avoid commit
@ 2010-12-13 20:53     ` Trond Myklebust
  0 siblings, 0 replies; 202+ messages in thread
From: Trond Myklebust @ 2010-12-13 20:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	linux-mm, linux-fsdevel, LKML

On Mon, 2010-12-13 at 22:47 +0800, Wu Fengguang wrote:
> plain text document attachment (writeback-nfs-should-commit.patch)
> The heuristics introduced by commit 420e3646 ("NFS: Reduce the number of
> unnecessary COMMIT calls") do not work well for large inodes being
> actively written to.
> 
> Refine the criterion to
> - it has gone quiet (all data transfered to server)
> - has accumulated >= 4MB data to commit (so it will be large IO)
> - too few active commits (hence active IO) in the server

Where does the number 4MB come from? If I'm writing a 4GB file, I
certainly do not want to commit every 4MB; that would make for a total
of 1000 commit requests in addition to the writes. On a 64-bit client
+server both having loads of memory and connected by a decently a fast
network, that can be a significant slowdown...

Most of the time, we really want the server to be managing its dirty
cache entirely independently of the client. The latter should only be
sending the commit when it really needs to free up those pages.

Cheers
  Trond


-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 31/35] nfs: dont change wbc->nr_to_write in write_inode()
  2010-12-13 14:47   ` Wu Fengguang
@ 2010-12-13 21:01     ` Trond Myklebust
  -1 siblings, 0 replies; 202+ messages in thread
From: Trond Myklebust @ 2010-12-13 21:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	linux-mm, linux-fsdevel, LKML

On Mon, 2010-12-13 at 22:47 +0800, Wu Fengguang wrote:
> plain text document attachment
> (writeback-nfs-commit-remove-nr_to_write.patch)
> It's introduced in commit 420e3646 ("NFS: Reduce the number of
> unnecessary COMMIT calls") and seems not necessary.
> 
> CC: Trond Myklebust <Trond.Myklebust@netapp.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/nfs/write.c |    9 +--------
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> --- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:21.000000000 +0800
> +++ linux-next/fs/nfs/write.c	2010-12-13 21:46:22.000000000 +0800
> @@ -1557,15 +1557,8 @@ static int nfs_commit_unstable_pages(str
>  	}
>  
>  	ret = nfs_commit_inode(inode, flags);
> -	if (ret >= 0) {
> -		if (wbc->sync_mode == WB_SYNC_NONE) {
> -			if (ret < wbc->nr_to_write)
> -				wbc->nr_to_write -= ret;
> -			else
> -				wbc->nr_to_write = 0;
> -		}
> +	if (ret >= 0)
>  		return 0;
> -	}
>  out_mark_dirty:
>  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
>  	return ret;

It is there in order to tell the VM that it has succeeded in freeing up
a certain number of pages. Otherwise, we end up cycling forever in
writeback_sb_inodes() & friends with the latter not realising that they
have made progress.

Cheers
  Trond

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 31/35] nfs: dont change wbc->nr_to_write in write_inode()
@ 2010-12-13 21:01     ` Trond Myklebust
  0 siblings, 0 replies; 202+ messages in thread
From: Trond Myklebust @ 2010-12-13 21:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	linux-mm, linux-fsdevel, LKML

On Mon, 2010-12-13 at 22:47 +0800, Wu Fengguang wrote:
> plain text document attachment
> (writeback-nfs-commit-remove-nr_to_write.patch)
> It's introduced in commit 420e3646 ("NFS: Reduce the number of
> unnecessary COMMIT calls") and seems not necessary.
> 
> CC: Trond Myklebust <Trond.Myklebust@netapp.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/nfs/write.c |    9 +--------
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> --- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:21.000000000 +0800
> +++ linux-next/fs/nfs/write.c	2010-12-13 21:46:22.000000000 +0800
> @@ -1557,15 +1557,8 @@ static int nfs_commit_unstable_pages(str
>  	}
>  
>  	ret = nfs_commit_inode(inode, flags);
> -	if (ret >= 0) {
> -		if (wbc->sync_mode == WB_SYNC_NONE) {
> -			if (ret < wbc->nr_to_write)
> -				wbc->nr_to_write -= ret;
> -			else
> -				wbc->nr_to_write = 0;
> -		}
> +	if (ret >= 0)
>  		return 0;
> -	}
>  out_mark_dirty:
>  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
>  	return ret;

It is there in order to tell the VM that it has succeeded in freeing up
a certain number of pages. Otherwise, we end up cycling forever in
writeback_sb_inodes() & friends with the latter not realising that they
have made progress.

Cheers
  Trond

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 29/35] nfs: in-commit pages accounting and wait queue
  2010-12-13 14:47   ` Wu Fengguang
@ 2010-12-13 21:15     ` Trond Myklebust
  -1 siblings, 0 replies; 202+ messages in thread
From: Trond Myklebust @ 2010-12-13 21:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	linux-mm, linux-fsdevel, LKML

On Mon, 2010-12-13 at 22:47 +0800, Wu Fengguang wrote:
> plain text document attachment (writeback-nfs-in-commit.patch)
> When doing 10+ concurrent dd's, I observed very bumpy commits submission
> (partly because the dd's are started at the same time, and hence reached
> 4MB to-commit pages at the same time). Basically we rely on the server
> to complete and return write/commit requests, and want both to progress
> smoothly and not consume too many pages. The write request wait queue is
> not enough as it's mainly network bounded. So add another commit request
> wait queue. Only async writes need to sleep on this queue.
> 

I'm not understanding the above reasoning. Why should we serialise
commits at the per-filesystem level (and only for non-blocking flushes
at that)?

Cheers
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 29/35] nfs: in-commit pages accounting and wait queue
@ 2010-12-13 21:15     ` Trond Myklebust
  0 siblings, 0 replies; 202+ messages in thread
From: Trond Myklebust @ 2010-12-13 21:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	linux-mm, linux-fsdevel, LKML

On Mon, 2010-12-13 at 22:47 +0800, Wu Fengguang wrote:
> plain text document attachment (writeback-nfs-in-commit.patch)
> When doing 10+ concurrent dd's, I observed very bumpy commits submission
> (partly because the dd's are started at the same time, and hence reached
> 4MB to-commit pages at the same time). Basically we rely on the server
> to complete and return write/commit requests, and want both to progress
> smoothly and not consume too many pages. The write request wait queue is
> not enough as it's mainly network bounded. So add another commit request
> wait queue. Only async writes need to sleep on this queue.
> 

I'm not understanding the above reasoning. Why should we serialise
commits at the per-filesystem level (and only for non-blocking flushes
at that)?

Cheers
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 12/35] writeback: scale down max throttle bandwidth on concurrent dirtiers
  2010-12-13 14:46   ` Wu Fengguang
@ 2010-12-14  1:21     ` Yan, Zheng
  -1 siblings, 0 replies; 202+ messages in thread
From: Yan, Zheng @ 2010-12-14  1:21 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Mon, Dec 13, 2010 at 10:46 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> This will noticeably reduce the fluctuaions of pause time when there are
> 100+ concurrent dirtiers.
>
> The more parallel dirtiers (1 dirtier => 4 dirtiers), the smaller
> bandwidth each dirtier will share (bdi_bandwidth => bdi_bandwidth/4),
> the less gap to the dirty limit ((C-A) => (C-B)), the less stable the
> pause time will be (given the same fluctuation of bdi_dirty).
>
> For example, if A drifts to A', its pause time may drift from 5ms to
> 6ms, while B to B' may drift from 50ms to 90ms.  It's much larger
> fluctuations in relative ratio as well as absolute time.
>
> Fig.1 before patch, gap (C-B) is too low to get smooth pause time
>
> throttle_bandwidth_A = bdi_bandwidth .........o
>                                              | o <= A'
>                                              |   o
>                                              |     o
>                                              |       o
>                                              |         o
> throttle_bandwidth_B = bdi_bandwidth / 4 .....|...........o
>                                              |           | o <= B'
> ----------------------------------------------+-----------+---o
>                                              A           B   C
>
> The solution is to lower the slope of the throttle line accordingly,
> which makes B stabilize at some point more far away from C.
>
> Fig.2 after patch
>
> throttle_bandwidth_A = bdi_bandwidth .........o
>                                              | o <= A'
>                                              |   o
>                                              |     o
>    lowered max throttle bandwidth for B ===> *       o
>                                              |   *     o
> throttle_bandwidth_B = bdi_bandwidth / 4 .............*   o
>                                              |       |   * o
> ----------------------------------------------+-------+-------o
>                                              A       B       C
>
> Note that C is actually different points for 1-dirty and 4-dirtiers
> cases, but for easy graphing, we move them together.
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |   16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
>
> --- linux-next.orig/mm/page-writeback.c 2010-12-13 21:46:14.000000000 +0800
> +++ linux-next/mm/page-writeback.c      2010-12-13 21:46:15.000000000 +0800
> @@ -587,6 +587,7 @@ static void balance_dirty_pages(struct a
>        unsigned long background_thresh;
>        unsigned long dirty_thresh;
>        unsigned long bdi_thresh;
> +       unsigned long task_thresh;
>        unsigned long long bw;
>        unsigned long period;
>        unsigned long pause = 0;
> @@ -616,7 +617,7 @@ static void balance_dirty_pages(struct a
>                        break;
>
>                bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh, nr_dirty);
> -               bdi_thresh = task_dirty_limit(current, bdi_thresh);
> +               task_thresh = task_dirty_limit(current, bdi_thresh);
>
>                /*
>                 * In order to avoid the stacked BDI deadlock we need
> @@ -638,14 +639,23 @@ static void balance_dirty_pages(struct a
>
>                bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
>
> -               if (bdi_dirty >= bdi_thresh || nr_dirty > dirty_thresh) {
> +               if (bdi_dirty >= task_thresh || nr_dirty > dirty_thresh) {
>                        pause = MAX_PAUSE;
>                        goto pause;
>                }
>
> +               /*
> +                * When bdi_dirty grows closer to bdi_thresh, it indicates more
> +                * concurrent dirtiers. Proportionally lower the max throttle
> +                * bandwidth. This will resist bdi_dirty from approaching to
> +                * close to task_thresh, and help reduce fluctuations of pause
> +                * time when there are lots of dirtiers.
> +                */
>                bw = bdi->write_bandwidth;
> -
>                bw = bw * (bdi_thresh - bdi_dirty);
> +               do_div(bw, bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);
> +
> +               bw = bw * (task_thresh - bdi_dirty);
>                do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);

Maybe changing this line to "do_div(bw, task_thresh /
TASK_SOFT_DIRTY_LIMIT + 1);"
is more consistent.

Thanks
Yan, Zheng

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 12/35] writeback: scale down max throttle bandwidth on concurrent dirtiers
@ 2010-12-14  1:21     ` Yan, Zheng
  0 siblings, 0 replies; 202+ messages in thread
From: Yan, Zheng @ 2010-12-14  1:21 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Mon, Dec 13, 2010 at 10:46 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> This will noticeably reduce the fluctuaions of pause time when there are
> 100+ concurrent dirtiers.
>
> The more parallel dirtiers (1 dirtier => 4 dirtiers), the smaller
> bandwidth each dirtier will share (bdi_bandwidth => bdi_bandwidth/4),
> the less gap to the dirty limit ((C-A) => (C-B)), the less stable the
> pause time will be (given the same fluctuation of bdi_dirty).
>
> For example, if A drifts to A', its pause time may drift from 5ms to
> 6ms, while B to B' may drift from 50ms to 90ms.  It's much larger
> fluctuations in relative ratio as well as absolute time.
>
> Fig.1 before patch, gap (C-B) is too low to get smooth pause time
>
> throttle_bandwidth_A = bdi_bandwidth .........o
>                                              | o <= A'
>                                              |   o
>                                              |     o
>                                              |       o
>                                              |         o
> throttle_bandwidth_B = bdi_bandwidth / 4 .....|...........o
>                                              |           | o <= B'
> ----------------------------------------------+-----------+---o
>                                              A           B   C
>
> The solution is to lower the slope of the throttle line accordingly,
> which makes B stabilize at some point more far away from C.
>
> Fig.2 after patch
>
> throttle_bandwidth_A = bdi_bandwidth .........o
>                                              | o <= A'
>                                              |   o
>                                              |     o
>    lowered max throttle bandwidth for B ===> *       o
>                                              |   *     o
> throttle_bandwidth_B = bdi_bandwidth / 4 .............*   o
>                                              |       |   * o
> ----------------------------------------------+-------+-------o
>                                              A       B       C
>
> Note that C is actually different points for 1-dirty and 4-dirtiers
> cases, but for easy graphing, we move them together.
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |   16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
>
> --- linux-next.orig/mm/page-writeback.c 2010-12-13 21:46:14.000000000 +0800
> +++ linux-next/mm/page-writeback.c      2010-12-13 21:46:15.000000000 +0800
> @@ -587,6 +587,7 @@ static void balance_dirty_pages(struct a
>        unsigned long background_thresh;
>        unsigned long dirty_thresh;
>        unsigned long bdi_thresh;
> +       unsigned long task_thresh;
>        unsigned long long bw;
>        unsigned long period;
>        unsigned long pause = 0;
> @@ -616,7 +617,7 @@ static void balance_dirty_pages(struct a
>                        break;
>
>                bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh, nr_dirty);
> -               bdi_thresh = task_dirty_limit(current, bdi_thresh);
> +               task_thresh = task_dirty_limit(current, bdi_thresh);
>
>                /*
>                 * In order to avoid the stacked BDI deadlock we need
> @@ -638,14 +639,23 @@ static void balance_dirty_pages(struct a
>
>                bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
>
> -               if (bdi_dirty >= bdi_thresh || nr_dirty > dirty_thresh) {
> +               if (bdi_dirty >= task_thresh || nr_dirty > dirty_thresh) {
>                        pause = MAX_PAUSE;
>                        goto pause;
>                }
>
> +               /*
> +                * When bdi_dirty grows closer to bdi_thresh, it indicates more
> +                * concurrent dirtiers. Proportionally lower the max throttle
> +                * bandwidth. This will resist bdi_dirty from approaching to
> +                * close to task_thresh, and help reduce fluctuations of pause
> +                * time when there are lots of dirtiers.
> +                */
>                bw = bdi->write_bandwidth;
> -
>                bw = bw * (bdi_thresh - bdi_dirty);
> +               do_div(bw, bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);
> +
> +               bw = bw * (task_thresh - bdi_dirty);
>                do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);

Maybe changing this line to "do_div(bw, task_thresh /
TASK_SOFT_DIRTY_LIMIT + 1);"
is more consistent.

Thanks
Yan, Zheng

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 00/35] IO-less dirty throttling v4
       [not found] ` <AANLkTinFeu7LMaDFgUcP3r2oqVHE5bei3T5JTPGBNvS9@mail.gmail.com>
@ 2010-12-14  4:59     ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14  4:59 UTC (permalink / raw)
  To: Yan, Zheng, Andrew Morton
  Cc: Jan Kara, Wu, Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, Dec 14, 2010 at 11:26:29AM +0800, Yan, Zheng wrote:
> got error "global_dirty_limits" [fs/nfs/nfs.ko] undefined! when
> compiling dirty-throttling-v4

Thanks! This should fix it. The fix will show up in the git tree after a while.

Thanks,
Fengguang
---
Subject: writeback: export global_dirty_limits() for NFS
Date: Tue Dec 14 12:55:18 CST 2010

"global_dirty_limits" [fs/nfs/nfs.ko] undefined!

Reported-by: Yan Zheng <zheng.z.yan@linux.intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    1 +
 1 file changed, 1 insertion(+)

--- linux-next.orig/mm/page-writeback.c	2010-12-14 12:54:57.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-14 12:55:11.000000000 +0800
@@ -419,6 +419,7 @@ void global_dirty_limits(unsigned long *
 	*pbackground = background;
 	*pdirty = dirty;
 }
+EXPORT_SYMBOL_GPL(global_dirty_limits);
 
 /**
  * bdi_dirty_limit - @bdi's share of dirty throttling threshold

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 00/35] IO-less dirty throttling v4
@ 2010-12-14  4:59     ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14  4:59 UTC (permalink / raw)
  To: Yan, Zheng, Andrew Morton
  Cc: Jan Kara, Wu, Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, Dec 14, 2010 at 11:26:29AM +0800, Yan, Zheng wrote:
> got error "global_dirty_limits" [fs/nfs/nfs.ko] undefined! when
> compiling dirty-throttling-v4

Thanks! This should fix it. The fix will show up in the git tree after a while.

Thanks,
Fengguang
---
Subject: writeback: export global_dirty_limits() for NFS
Date: Tue Dec 14 12:55:18 CST 2010

"global_dirty_limits" [fs/nfs/nfs.ko] undefined!

Reported-by: Yan Zheng <zheng.z.yan@linux.intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    1 +
 1 file changed, 1 insertion(+)

--- linux-next.orig/mm/page-writeback.c	2010-12-14 12:54:57.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-14 12:55:11.000000000 +0800
@@ -419,6 +419,7 @@ void global_dirty_limits(unsigned long *
 	*pbackground = background;
 	*pdirty = dirty;
 }
+EXPORT_SYMBOL_GPL(global_dirty_limits);
 
 /**
  * bdi_dirty_limit - @bdi's share of dirty throttling threshold

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 16/35] writeback: increase min pause time on concurrent dirtiers
  2010-12-13 18:23   ` Valdis.Kletnieks
@ 2010-12-14  6:51       ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14  6:51 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Andrew Morton, Jan Kara, Dave Chinner, Christoph Hellwig,
	Trond Myklebust, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, Dec 14, 2010 at 02:23:31AM +0800, Valdis.Kletnieks@vt.edu wrote:
> On Mon, 13 Dec 2010 22:47:02 +0800, Wu Fengguang said:
> > Target for >60ms pause time when there are 100+ heavy dirtiers per bdi.
> > (will average around 100ms given 200ms max pause time)
> 
> > --- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
> > @@ -659,6 +659,27 @@ static unsigned long max_pause(unsigned 
> >  }
> >  
> >  /*
> > + * Scale up pause time for concurrent dirtiers in order to reduce CPU overheads.
> > + * But ensure reasonably large [min_pause, max_pause] range size, so that
> > + * nr_dirtied_pause (and hence future pause time) can stay reasonably stable.
> > + */
> > +static unsigned long min_pause(struct backing_dev_info *bdi,
> > +			       unsigned long max)
> > +{
> > +	unsigned long hi = ilog2(bdi->write_bandwidth);
> > +	unsigned long lo = ilog2(bdi->throttle_bandwidth);
> > +	unsigned long t;
> > +
> > +	if (lo >= hi)
> > +		return 1;
> > +
> > +	/* (N * 10ms) on 2^N concurrent tasks */
> > +	t = (hi - lo) * (10 * HZ) / 1024;
> 
> Either I need more caffeine, or the comment doesn't match the code
> if HZ != 1000?

The "ms" in the comment may be confusing, but the pause time (t) is
measured in jiffies :)  Hope the below patch helps.

Thanks,
Fengguang
---
Subject: writeback: pause time is measured in jiffies
Date: Tue Dec 14 14:46:23 CST 2010

Add comments to make it clear.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-14 14:45:15.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-14 14:46:20.000000000 +0800
@@ -649,7 +649,7 @@ unlock:
  */
 static unsigned long max_pause(unsigned long bdi_thresh)
 {
-	unsigned long t;
+	unsigned long t;  /* jiffies */
 
 	/* 1ms for every 4MB */
 	t = bdi_thresh >> (32 - PAGE_CACHE_SHIFT -
@@ -669,7 +669,7 @@ static unsigned long min_pause(struct ba
 {
 	unsigned long hi = ilog2(bdi->write_bandwidth);
 	unsigned long lo = ilog2(bdi->throttle_bandwidth);
-	unsigned long t;
+	unsigned long t;  /* jiffies */
 
 	if (lo >= hi)
 		return 1;

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 16/35] writeback: increase min pause time on concurrent dirtiers
@ 2010-12-14  6:51       ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14  6:51 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Andrew Morton, Jan Kara, Dave Chinner, Christoph Hellwig,
	Trond Myklebust, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, Dec 14, 2010 at 02:23:31AM +0800, Valdis.Kletnieks@vt.edu wrote:
> On Mon, 13 Dec 2010 22:47:02 +0800, Wu Fengguang said:
> > Target for >60ms pause time when there are 100+ heavy dirtiers per bdi.
> > (will average around 100ms given 200ms max pause time)
> 
> > --- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
> > @@ -659,6 +659,27 @@ static unsigned long max_pause(unsigned 
> >  }
> >  
> >  /*
> > + * Scale up pause time for concurrent dirtiers in order to reduce CPU overheads.
> > + * But ensure reasonably large [min_pause, max_pause] range size, so that
> > + * nr_dirtied_pause (and hence future pause time) can stay reasonably stable.
> > + */
> > +static unsigned long min_pause(struct backing_dev_info *bdi,
> > +			       unsigned long max)
> > +{
> > +	unsigned long hi = ilog2(bdi->write_bandwidth);
> > +	unsigned long lo = ilog2(bdi->throttle_bandwidth);
> > +	unsigned long t;
> > +
> > +	if (lo >= hi)
> > +		return 1;
> > +
> > +	/* (N * 10ms) on 2^N concurrent tasks */
> > +	t = (hi - lo) * (10 * HZ) / 1024;
> 
> Either I need more caffeine, or the comment doesn't match the code
> if HZ != 1000?

The "ms" in the comment may be confusing, but the pause time (t) is
measured in jiffies :)  Hope the below patch helps.

Thanks,
Fengguang
---
Subject: writeback: pause time is measured in jiffies
Date: Tue Dec 14 14:46:23 CST 2010

Add comments to make it clear.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-14 14:45:15.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-14 14:46:20.000000000 +0800
@@ -649,7 +649,7 @@ unlock:
  */
 static unsigned long max_pause(unsigned long bdi_thresh)
 {
-	unsigned long t;
+	unsigned long t;  /* jiffies */
 
 	/* 1ms for every 4MB */
 	t = bdi_thresh >> (32 - PAGE_CACHE_SHIFT -
@@ -669,7 +669,7 @@ static unsigned long min_pause(struct ba
 {
 	unsigned long hi = ilog2(bdi->write_bandwidth);
 	unsigned long lo = ilog2(bdi->throttle_bandwidth);
-	unsigned long t;
+	unsigned long t;  /* jiffies */
 
 	if (lo >= hi)
 		return 1;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 12/35] writeback: scale down max throttle bandwidth on concurrent dirtiers
  2010-12-14  1:21     ` Yan, Zheng
  (?)
@ 2010-12-14  7:00       ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14  7:00 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, Dec 14, 2010 at 09:21:19AM +0800, Yan Zheng wrote:
> On Mon, Dec 13, 2010 at 10:46 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > This will noticeably reduce the fluctuaions of pause time when there are
> > 100+ concurrent dirtiers.
> >
> > The more parallel dirtiers (1 dirtier => 4 dirtiers), the smaller
> > bandwidth each dirtier will share (bdi_bandwidth => bdi_bandwidth/4),
> > the less gap to the dirty limit ((C-A) => (C-B)), the less stable the
> > pause time will be (given the same fluctuation of bdi_dirty).
> >
> > For example, if A drifts to A', its pause time may drift from 5ms to
> > 6ms, while B to B' may drift from 50ms to 90ms.  It's much larger
> > fluctuations in relative ratio as well as absolute time.
> >
> > Fig.1 before patch, gap (C-B) is too low to get smooth pause time
> >
> > throttle_bandwidth_A = bdi_bandwidth .........o
> >                                              | o <= A'
> >                                              |   o
> >                                              |     o
> >                                              |       o
> >                                              |         o
> > throttle_bandwidth_B = bdi_bandwidth / 4 .....|...........o
> >                                              |           | o <= B'
> > ----------------------------------------------+-----------+---o
> >                                              A           B   C
> >
> > The solution is to lower the slope of the throttle line accordingly,
> > which makes B stabilize at some point more far away from C.
> >
> > Fig.2 after patch
> >
> > throttle_bandwidth_A = bdi_bandwidth .........o
> >                                              | o <= A'
> >                                              |   o
> >                                              |     o
> >    lowered max throttle bandwidth for B ===> *       o
> >                                              |   *     o
> > throttle_bandwidth_B = bdi_bandwidth / 4 .............*   o
> >                                              |       |   * o
> > ----------------------------------------------+-------+-------o
> >                                              A       B       C
> >
> > Note that C is actually different points for 1-dirty and 4-dirtiers
> > cases, but for easy graphing, we move them together.
> >
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/page-writeback.c |   16 +++++++++++++---
> >  1 file changed, 13 insertions(+), 3 deletions(-)
> >
> > --- linux-next.orig/mm/page-writeback.c 2010-12-13 21:46:14.000000000 +0800
> > +++ linux-next/mm/page-writeback.c      2010-12-13 21:46:15.000000000 +0800
> > @@ -587,6 +587,7 @@ static void balance_dirty_pages(struct a
> >        unsigned long background_thresh;
> >        unsigned long dirty_thresh;
> >        unsigned long bdi_thresh;
> > +       unsigned long task_thresh;
> >        unsigned long long bw;
> >        unsigned long period;
> >        unsigned long pause = 0;
> > @@ -616,7 +617,7 @@ static void balance_dirty_pages(struct a
> >                        break;
> >
> >                bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh, nr_dirty);
> > -               bdi_thresh = task_dirty_limit(current, bdi_thresh);
> > +               task_thresh = task_dirty_limit(current, bdi_thresh);
> >
> >                /*
> >                 * In order to avoid the stacked BDI deadlock we need
> > @@ -638,14 +639,23 @@ static void balance_dirty_pages(struct a
> >
> >                bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
> >
> > -               if (bdi_dirty >= bdi_thresh || nr_dirty > dirty_thresh) {
> > +               if (bdi_dirty >= task_thresh || nr_dirty > dirty_thresh) {
> >                        pause = MAX_PAUSE;
> >                        goto pause;
> >                }
> >
> > +               /*
> > +                * When bdi_dirty grows closer to bdi_thresh, it indicates more
> > +                * concurrent dirtiers. Proportionally lower the max throttle
> > +                * bandwidth. This will resist bdi_dirty from approaching to
> > +                * close to task_thresh, and help reduce fluctuations of pause
> > +                * time when there are lots of dirtiers.
> > +                */
> >                bw = bdi->write_bandwidth;
> > -
> >                bw = bw * (bdi_thresh - bdi_dirty);
> > +               do_div(bw, bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);
> > +
> > +               bw = bw * (task_thresh - bdi_dirty);
> >                do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
> 
> Maybe changing this line to "do_div(bw, task_thresh /
> TASK_SOFT_DIRTY_LIMIT + 1);"
> is more consistent.

I'll show you another consistency of "shape" :)

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/light-dirtier-control-line.svg
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/heavy-dirtier-control-line.svg

In the above two figures, the overall control lines for light/heavy
dirtier tasks have exactly the same shape -- it's merely shifted in
the X axis direction. So the current form is actually more simple.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 12/35] writeback: scale down max throttle bandwidth on concurrent dirtiers
@ 2010-12-14  7:00       ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14  7:00 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, Dec 14, 2010 at 09:21:19AM +0800, Yan Zheng wrote:
> On Mon, Dec 13, 2010 at 10:46 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > This will noticeably reduce the fluctuaions of pause time when there are
> > 100+ concurrent dirtiers.
> >
> > The more parallel dirtiers (1 dirtier => 4 dirtiers), the smaller
> > bandwidth each dirtier will share (bdi_bandwidth => bdi_bandwidth/4),
> > the less gap to the dirty limit ((C-A) => (C-B)), the less stable the
> > pause time will be (given the same fluctuation of bdi_dirty).
> >
> > For example, if A drifts to A', its pause time may drift from 5ms to
> > 6ms, while B to B' may drift from 50ms to 90ms.  It's much larger
> > fluctuations in relative ratio as well as absolute time.
> >
> > Fig.1 before patch, gap (C-B) is too low to get smooth pause time
> >
> > throttle_bandwidth_A = bdi_bandwidth .........o
> >                                              | o <= A'
> >                                              |   o
> >                                              |     o
> >                                              |       o
> >                                              |         o
> > throttle_bandwidth_B = bdi_bandwidth / 4 .....|...........o
> >                                              |           | o <= B'
> > ----------------------------------------------+-----------+---o
> >                                              A           B   C
> >
> > The solution is to lower the slope of the throttle line accordingly,
> > which makes B stabilize at some point more far away from C.
> >
> > Fig.2 after patch
> >
> > throttle_bandwidth_A = bdi_bandwidth .........o
> >                                              | o <= A'
> >                                              |   o
> >                                              |     o
> >    lowered max throttle bandwidth for B ===> *       o
> >                                              |   *     o
> > throttle_bandwidth_B = bdi_bandwidth / 4 .............*   o
> >                                              |       |   * o
> > ----------------------------------------------+-------+-------o
> >                                              A       B       C
> >
> > Note that C is actually different points for 1-dirty and 4-dirtiers
> > cases, but for easy graphing, we move them together.
> >
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/page-writeback.c |   16 +++++++++++++---
> >  1 file changed, 13 insertions(+), 3 deletions(-)
> >
> > --- linux-next.orig/mm/page-writeback.c 2010-12-13 21:46:14.000000000 +0800
> > +++ linux-next/mm/page-writeback.c      2010-12-13 21:46:15.000000000 +0800
> > @@ -587,6 +587,7 @@ static void balance_dirty_pages(struct a
> >        unsigned long background_thresh;
> >        unsigned long dirty_thresh;
> >        unsigned long bdi_thresh;
> > +       unsigned long task_thresh;
> >        unsigned long long bw;
> >        unsigned long period;
> >        unsigned long pause = 0;
> > @@ -616,7 +617,7 @@ static void balance_dirty_pages(struct a
> >                        break;
> >
> >                bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh, nr_dirty);
> > -               bdi_thresh = task_dirty_limit(current, bdi_thresh);
> > +               task_thresh = task_dirty_limit(current, bdi_thresh);
> >
> >                /*
> >                 * In order to avoid the stacked BDI deadlock we need
> > @@ -638,14 +639,23 @@ static void balance_dirty_pages(struct a
> >
> >                bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
> >
> > -               if (bdi_dirty >= bdi_thresh || nr_dirty > dirty_thresh) {
> > +               if (bdi_dirty >= task_thresh || nr_dirty > dirty_thresh) {
> >                        pause = MAX_PAUSE;
> >                        goto pause;
> >                }
> >
> > +               /*
> > +                * When bdi_dirty grows closer to bdi_thresh, it indicates more
> > +                * concurrent dirtiers. Proportionally lower the max throttle
> > +                * bandwidth. This will resist bdi_dirty from approaching to
> > +                * close to task_thresh, and help reduce fluctuations of pause
> > +                * time when there are lots of dirtiers.
> > +                */
> >                bw = bdi->write_bandwidth;
> > -
> >                bw = bw * (bdi_thresh - bdi_dirty);
> > +               do_div(bw, bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);
> > +
> > +               bw = bw * (task_thresh - bdi_dirty);
> >                do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
> 
> Maybe changing this line to "do_div(bw, task_thresh /
> TASK_SOFT_DIRTY_LIMIT + 1);"
> is more consistent.

I'll show you another consistency of "shape" :)

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/light-dirtier-control-line.svg
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/heavy-dirtier-control-line.svg

In the above two figures, the overall control lines for light/heavy
dirtier tasks have exactly the same shape -- it's merely shifted in
the X axis direction. So the current form is actually more simple.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 12/35] writeback: scale down max throttle bandwidth on concurrent dirtiers
@ 2010-12-14  7:00       ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14  7:00 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, Dec 14, 2010 at 09:21:19AM +0800, Yan Zheng wrote:
> On Mon, Dec 13, 2010 at 10:46 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > This will noticeably reduce the fluctuaions of pause time when there are
> > 100+ concurrent dirtiers.
> >
> > The more parallel dirtiers (1 dirtier => 4 dirtiers), the smaller
> > bandwidth each dirtier will share (bdi_bandwidth => bdi_bandwidth/4),
> > the less gap to the dirty limit ((C-A) => (C-B)), the less stable the
> > pause time will be (given the same fluctuation of bdi_dirty).
> >
> > For example, if A drifts to A', its pause time may drift from 5ms to
> > 6ms, while B to B' may drift from 50ms to 90ms. A It's much larger
> > fluctuations in relative ratio as well as absolute time.
> >
> > Fig.1 before patch, gap (C-B) is too low to get smooth pause time
> >
> > throttle_bandwidth_A = bdi_bandwidth .........o
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A | o <= A'
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A | A  o
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A | A  A  o
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A | A  A  A  o
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A | A  A  A  A  o
> > throttle_bandwidth_B = bdi_bandwidth / 4 .....|...........o
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A | A  A  A  A  A  | o <= B'
> > ----------------------------------------------+-----------+---o
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A A A  A  A  A  A  B A  C
> >
> > The solution is to lower the slope of the throttle line accordingly,
> > which makes B stabilize at some point more far away from C.
> >
> > Fig.2 after patch
> >
> > throttle_bandwidth_A = bdi_bandwidth .........o
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A | o <= A'
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A | A  o
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A | A  A  o
> > A  A lowered max throttle bandwidth for B ===> * A  A  A  o
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A | A  * A  A  o
> > throttle_bandwidth_B = bdi_bandwidth / 4 .............* A  o
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A | A  A  A  | A  * o
> > ----------------------------------------------+-------+-------o
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A A A  A  A  B A  A  A  C
> >
> > Note that C is actually different points for 1-dirty and 4-dirtiers
> > cases, but for easy graphing, we move them together.
> >
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> > A mm/page-writeback.c | A  16 +++++++++++++---
> > A 1 file changed, 13 insertions(+), 3 deletions(-)
> >
> > --- linux-next.orig/mm/page-writeback.c 2010-12-13 21:46:14.000000000 +0800
> > +++ linux-next/mm/page-writeback.c A  A  A 2010-12-13 21:46:15.000000000 +0800
> > @@ -587,6 +587,7 @@ static void balance_dirty_pages(struct a
> > A  A  A  A unsigned long background_thresh;
> > A  A  A  A unsigned long dirty_thresh;
> > A  A  A  A unsigned long bdi_thresh;
> > + A  A  A  unsigned long task_thresh;
> > A  A  A  A unsigned long long bw;
> > A  A  A  A unsigned long period;
> > A  A  A  A unsigned long pause = 0;
> > @@ -616,7 +617,7 @@ static void balance_dirty_pages(struct a
> > A  A  A  A  A  A  A  A  A  A  A  A break;
> >
> > A  A  A  A  A  A  A  A bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh, nr_dirty);
> > - A  A  A  A  A  A  A  bdi_thresh = task_dirty_limit(current, bdi_thresh);
> > + A  A  A  A  A  A  A  task_thresh = task_dirty_limit(current, bdi_thresh);
> >
> > A  A  A  A  A  A  A  A /*
> > A  A  A  A  A  A  A  A  * In order to avoid the stacked BDI deadlock we need
> > @@ -638,14 +639,23 @@ static void balance_dirty_pages(struct a
> >
> > A  A  A  A  A  A  A  A bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
> >
> > - A  A  A  A  A  A  A  if (bdi_dirty >= bdi_thresh || nr_dirty > dirty_thresh) {
> > + A  A  A  A  A  A  A  if (bdi_dirty >= task_thresh || nr_dirty > dirty_thresh) {
> > A  A  A  A  A  A  A  A  A  A  A  A pause = MAX_PAUSE;
> > A  A  A  A  A  A  A  A  A  A  A  A goto pause;
> > A  A  A  A  A  A  A  A }
> >
> > + A  A  A  A  A  A  A  /*
> > + A  A  A  A  A  A  A  A * When bdi_dirty grows closer to bdi_thresh, it indicates more
> > + A  A  A  A  A  A  A  A * concurrent dirtiers. Proportionally lower the max throttle
> > + A  A  A  A  A  A  A  A * bandwidth. This will resist bdi_dirty from approaching to
> > + A  A  A  A  A  A  A  A * close to task_thresh, and help reduce fluctuations of pause
> > + A  A  A  A  A  A  A  A * time when there are lots of dirtiers.
> > + A  A  A  A  A  A  A  A */
> > A  A  A  A  A  A  A  A bw = bdi->write_bandwidth;
> > -
> > A  A  A  A  A  A  A  A bw = bw * (bdi_thresh - bdi_dirty);
> > + A  A  A  A  A  A  A  do_div(bw, bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);
> > +
> > + A  A  A  A  A  A  A  bw = bw * (task_thresh - bdi_dirty);
> > A  A  A  A  A  A  A  A do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
> 
> Maybe changing this line to "do_div(bw, task_thresh /
> TASK_SOFT_DIRTY_LIMIT + 1);"
> is more consistent.

I'll show you another consistency of "shape" :)

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/light-dirtier-control-line.svg
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/heavy-dirtier-control-line.svg

In the above two figures, the overall control lines for light/heavy
dirtier tasks have exactly the same shape -- it's merely shifted in
the X axis direction. So the current form is actually more simple.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 30/35] nfs: heuristics to avoid commit
  2010-12-13 20:53     ` Trond Myklebust
@ 2010-12-14  8:20       ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14  8:20 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	linux-mm, linux-fsdevel, LKML, Tang, Feng

On Tue, Dec 14, 2010 at 04:53:46AM +0800, Trond Myklebust wrote:
> On Mon, 2010-12-13 at 22:47 +0800, Wu Fengguang wrote:
> > plain text document attachment (writeback-nfs-should-commit.patch)
> > The heuristics introduced by commit 420e3646 ("NFS: Reduce the number of
> > unnecessary COMMIT calls") do not work well for large inodes being
> > actively written to.
> > 
> > Refine the criterion to
> > - it has gone quiet (all data transfered to server)
> > - has accumulated >= 4MB data to commit (so it will be large IO)
> > - too few active commits (hence active IO) in the server
> 
> Where does the number 4MB come from? If I'm writing a 4GB file, I
> certainly do not want to commit every 4MB; that would make for a total
> of 1000 commit requests in addition to the writes. On a 64-bit client
> +server both having loads of memory and connected by a decently a fast
> network, that can be a significant slowdown...

Sorry the description omits too much details..

Let me show you the behavior in real workload first.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/writeback-inode.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/nfs-commit-300.png

On a 3GB client writing 50MB/s to the NFS server, the write chunk size
and commit size is mostly 32MB and 64MB.

The ->writepages() size and the later commit size actually scales up
to the available write bandwidth ("[PATCH 20/35] writeback: scale IO
chunk size up to device bandwidth").

So the "4MB" here is merely the minimal threshold. I chose it mainly
by the rule of thumb "it's not too bad IO size". And it's mainly used
for the cases:

1) low client=>server write bandwidth

In this case the VFS will call ->writepages() with small (but always
 >= 4MB, see patch 20/35) nr_to_write , and the 4MB threshold helps
accumulate to-be-commited pages over multiple ->write_inode() calls.
As you said it will help to further scale this 4MB threshold up to the
client's memory size. But complexity arises in the next case.

2) bandwidth/memory is high, but there are lots of concurrent dd's

When doing 10 dd's with mem=3G, it still achieves 20-30MB write/commit
size:
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-10dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-13/writeback-300.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-10dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-13/nfs-commit-300.png

However when there comes 100 dd's, you cannot wait each inode to
accumulate much more than 4MB pages to commit, because 4*100MB is
approaching the client's dirty limit. So you'll see around 4-5MB
commit sizes in this graph.
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/nfs-commit-300.png

Then you see the problem: how to decide one auto scaled threshold to
start commit for the current inode? It's easy for the 1-dd case.
However when there are N dd's (admittedly NFS clients rarely do large
N), we don't readily know the number N to scale down the threshold
that's suitable for 1-dd case..

So I give up the scale-to-memory commit threshold idea that could help
case (1) and just do it in a dumb but should good enough way. But I'm
open to better ideas :)

> Most of the time, we really want the server to be managing its dirty
> cache entirely independently of the client. The latter should only be
> sending the commit when it really needs to free up those pages.

Agreed. And it makes one major contrariety I'm fighting about: do large
commit size but not too much to make unacceptable fluctuations in the
data flow. It leads to the decision to include patch 20/35 into this
series. It magically reduces the frequency to ->writepages()/write_inode()
and results in semi-adaptive wrote pages in each ->writepages() (and
the later commit) to the number of concurrent dd's.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 30/35] nfs: heuristics to avoid commit
@ 2010-12-14  8:20       ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14  8:20 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	linux-mm, linux-fsdevel, LKML, Tang, Feng

On Tue, Dec 14, 2010 at 04:53:46AM +0800, Trond Myklebust wrote:
> On Mon, 2010-12-13 at 22:47 +0800, Wu Fengguang wrote:
> > plain text document attachment (writeback-nfs-should-commit.patch)
> > The heuristics introduced by commit 420e3646 ("NFS: Reduce the number of
> > unnecessary COMMIT calls") do not work well for large inodes being
> > actively written to.
> > 
> > Refine the criterion to
> > - it has gone quiet (all data transfered to server)
> > - has accumulated >= 4MB data to commit (so it will be large IO)
> > - too few active commits (hence active IO) in the server
> 
> Where does the number 4MB come from? If I'm writing a 4GB file, I
> certainly do not want to commit every 4MB; that would make for a total
> of 1000 commit requests in addition to the writes. On a 64-bit client
> +server both having loads of memory and connected by a decently a fast
> network, that can be a significant slowdown...

Sorry the description omits too much details..

Let me show you the behavior in real workload first.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/writeback-inode.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/nfs-commit-300.png

On a 3GB client writing 50MB/s to the NFS server, the write chunk size
and commit size is mostly 32MB and 64MB.

The ->writepages() size and the later commit size actually scales up
to the available write bandwidth ("[PATCH 20/35] writeback: scale IO
chunk size up to device bandwidth").

So the "4MB" here is merely the minimal threshold. I chose it mainly
by the rule of thumb "it's not too bad IO size". And it's mainly used
for the cases:

1) low client=>server write bandwidth

In this case the VFS will call ->writepages() with small (but always
 >= 4MB, see patch 20/35) nr_to_write , and the 4MB threshold helps
accumulate to-be-commited pages over multiple ->write_inode() calls.
As you said it will help to further scale this 4MB threshold up to the
client's memory size. But complexity arises in the next case.

2) bandwidth/memory is high, but there are lots of concurrent dd's

When doing 10 dd's with mem=3G, it still achieves 20-30MB write/commit
size:
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-10dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-13/writeback-300.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-10dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-13/nfs-commit-300.png

However when there comes 100 dd's, you cannot wait each inode to
accumulate much more than 4MB pages to commit, because 4*100MB is
approaching the client's dirty limit. So you'll see around 4-5MB
commit sizes in this graph.
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/nfs-commit-300.png

Then you see the problem: how to decide one auto scaled threshold to
start commit for the current inode? It's easy for the 1-dd case.
However when there are N dd's (admittedly NFS clients rarely do large
N), we don't readily know the number N to scale down the threshold
that's suitable for 1-dd case..

So I give up the scale-to-memory commit threshold idea that could help
case (1) and just do it in a dumb but should good enough way. But I'm
open to better ideas :)

> Most of the time, we really want the server to be managing its dirty
> cache entirely independently of the client. The latter should only be
> sending the commit when it really needs to free up those pages.

Agreed. And it makes one major contrariety I'm fighting about: do large
commit size but not too much to make unacceptable fluctuations in the
data flow. It leads to the decision to include patch 20/35 into this
series. It magically reduces the frequency to ->writepages()/write_inode()
and results in semi-adaptive wrote pages in each ->writepages() (and
the later commit) to the number of concurrent dd's.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 12/35] writeback: scale down max throttle bandwidth on concurrent dirtiers
  2010-12-14  7:00       ` Wu Fengguang
@ 2010-12-14 13:00         ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14 13:00 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, Dec 14, 2010 at 03:00:05PM +0800, Wu Fengguang wrote:
> On Tue, Dec 14, 2010 at 09:21:19AM +0800, Yan Zheng wrote:
> > On Mon, Dec 13, 2010 at 10:46 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > This will noticeably reduce the fluctuaions of pause time when there are
> > > 100+ concurrent dirtiers.
> > >
> > > The more parallel dirtiers (1 dirtier => 4 dirtiers), the smaller
> > > bandwidth each dirtier will share (bdi_bandwidth => bdi_bandwidth/4),
> > > the less gap to the dirty limit ((C-A) => (C-B)), the less stable the
> > > pause time will be (given the same fluctuation of bdi_dirty).
> > >
> > > For example, if A drifts to A', its pause time may drift from 5ms to
> > > 6ms, while B to B' may drift from 50ms to 90ms.  It's much larger
> > > fluctuations in relative ratio as well as absolute time.
> > >
> > > Fig.1 before patch, gap (C-B) is too low to get smooth pause time
> > >
> > > throttle_bandwidth_A = bdi_bandwidth .........o
> > >                                              | o <= A'
> > >                                              |   o
> > >                                              |     o
> > >                                              |       o
> > >                                              |         o
> > > throttle_bandwidth_B = bdi_bandwidth / 4 .....|...........o
> > >                                              |           | o <= B'
> > > ----------------------------------------------+-----------+---o
> > >                                              A           B   C
> > >
> > > The solution is to lower the slope of the throttle line accordingly,
> > > which makes B stabilize at some point more far away from C.
> > >
> > > Fig.2 after patch
> > >
> > > throttle_bandwidth_A = bdi_bandwidth .........o
> > >                                              | o <= A'
> > >                                              |   o
> > >                                              |     o
> > >    lowered max throttle bandwidth for B ===> *       o
> > >                                              |   *     o
> > > throttle_bandwidth_B = bdi_bandwidth / 4 .............*   o
> > >                                              |       |   * o
> > > ----------------------------------------------+-------+-------o
> > >                                              A       B       C
> > >
> > > Note that C is actually different points for 1-dirty and 4-dirtiers
> > > cases, but for easy graphing, we move them together.
> > >
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  mm/page-writeback.c |   16 +++++++++++++---
> > >  1 file changed, 13 insertions(+), 3 deletions(-)
> > >
> > > --- linux-next.orig/mm/page-writeback.c 2010-12-13 21:46:14.000000000 +0800
> > > +++ linux-next/mm/page-writeback.c      2010-12-13 21:46:15.000000000 +0800
> > > @@ -587,6 +587,7 @@ static void balance_dirty_pages(struct a
> > >        unsigned long background_thresh;
> > >        unsigned long dirty_thresh;
> > >        unsigned long bdi_thresh;
> > > +       unsigned long task_thresh;
> > >        unsigned long long bw;
> > >        unsigned long period;
> > >        unsigned long pause = 0;
> > > @@ -616,7 +617,7 @@ static void balance_dirty_pages(struct a
> > >                        break;
> > >
> > >                bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh, nr_dirty);
> > > -               bdi_thresh = task_dirty_limit(current, bdi_thresh);
> > > +               task_thresh = task_dirty_limit(current, bdi_thresh);
> > >
> > >                /*
> > >                 * In order to avoid the stacked BDI deadlock we need
> > > @@ -638,14 +639,23 @@ static void balance_dirty_pages(struct a
> > >
> > >                bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
> > >
> > > -               if (bdi_dirty >= bdi_thresh || nr_dirty > dirty_thresh) {
> > > +               if (bdi_dirty >= task_thresh || nr_dirty > dirty_thresh) {
> > >                        pause = MAX_PAUSE;
> > >                        goto pause;
> > >                }
> > >
> > > +               /*
> > > +                * When bdi_dirty grows closer to bdi_thresh, it indicates more
> > > +                * concurrent dirtiers. Proportionally lower the max throttle
> > > +                * bandwidth. This will resist bdi_dirty from approaching to
> > > +                * close to task_thresh, and help reduce fluctuations of pause
> > > +                * time when there are lots of dirtiers.
> > > +                */
> > >                bw = bdi->write_bandwidth;
> > > -
> > >                bw = bw * (bdi_thresh - bdi_dirty);
> > > +               do_div(bw, bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);
> > > +
> > > +               bw = bw * (task_thresh - bdi_dirty);
> > >                do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
> > 
> > Maybe changing this line to "do_div(bw, task_thresh /
> > TASK_SOFT_DIRTY_LIMIT + 1);"
> > is more consistent.
> 
> I'll show you another consistency of "shape" :)
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/light-dirtier-control-line.svg
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/heavy-dirtier-control-line.svg
> 
> In the above two figures, the overall control lines for light/heavy
> dirtier tasks have exactly the same shape -- it's merely shifted in
> the X axis direction. So the current form is actually more simple.

Sorry it's not the overall control lines that's simply shifted, but
the task control line.

bdi control line:
> > >                bw = bw * (bdi_thresh - bdi_dirty);
> > > +               do_div(bw, bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);

task control line:
> > > +               bw = bw * (task_thresh - bdi_dirty);
> > >                do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);

The use of bdi_thresh in the last line makes sure all task control
lines are of the same slope.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 12/35] writeback: scale down max throttle bandwidth on concurrent dirtiers
@ 2010-12-14 13:00         ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14 13:00 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, Dec 14, 2010 at 03:00:05PM +0800, Wu Fengguang wrote:
> On Tue, Dec 14, 2010 at 09:21:19AM +0800, Yan Zheng wrote:
> > On Mon, Dec 13, 2010 at 10:46 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > This will noticeably reduce the fluctuaions of pause time when there are
> > > 100+ concurrent dirtiers.
> > >
> > > The more parallel dirtiers (1 dirtier => 4 dirtiers), the smaller
> > > bandwidth each dirtier will share (bdi_bandwidth => bdi_bandwidth/4),
> > > the less gap to the dirty limit ((C-A) => (C-B)), the less stable the
> > > pause time will be (given the same fluctuation of bdi_dirty).
> > >
> > > For example, if A drifts to A', its pause time may drift from 5ms to
> > > 6ms, while B to B' may drift from 50ms to 90ms. A It's much larger
> > > fluctuations in relative ratio as well as absolute time.
> > >
> > > Fig.1 before patch, gap (C-B) is too low to get smooth pause time
> > >
> > > throttle_bandwidth_A = bdi_bandwidth .........o
> > > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A | o <= A'
> > > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A | A  o
> > > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A | A  A  o
> > > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A | A  A  A  o
> > > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A | A  A  A  A  o
> > > throttle_bandwidth_B = bdi_bandwidth / 4 .....|...........o
> > > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A | A  A  A  A  A  | o <= B'
> > > ----------------------------------------------+-----------+---o
> > > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A A A  A  A  A  A  B A  C
> > >
> > > The solution is to lower the slope of the throttle line accordingly,
> > > which makes B stabilize at some point more far away from C.
> > >
> > > Fig.2 after patch
> > >
> > > throttle_bandwidth_A = bdi_bandwidth .........o
> > > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A | o <= A'
> > > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A | A  o
> > > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A | A  A  o
> > > A  A lowered max throttle bandwidth for B ===> * A  A  A  o
> > > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A | A  * A  A  o
> > > throttle_bandwidth_B = bdi_bandwidth / 4 .............* A  o
> > > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A | A  A  A  | A  * o
> > > ----------------------------------------------+-------+-------o
> > > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A A A  A  A  B A  A  A  C
> > >
> > > Note that C is actually different points for 1-dirty and 4-dirtiers
> > > cases, but for easy graphing, we move them together.
> > >
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > > A mm/page-writeback.c | A  16 +++++++++++++---
> > > A 1 file changed, 13 insertions(+), 3 deletions(-)
> > >
> > > --- linux-next.orig/mm/page-writeback.c 2010-12-13 21:46:14.000000000 +0800
> > > +++ linux-next/mm/page-writeback.c A  A  A 2010-12-13 21:46:15.000000000 +0800
> > > @@ -587,6 +587,7 @@ static void balance_dirty_pages(struct a
> > > A  A  A  A unsigned long background_thresh;
> > > A  A  A  A unsigned long dirty_thresh;
> > > A  A  A  A unsigned long bdi_thresh;
> > > + A  A  A  unsigned long task_thresh;
> > > A  A  A  A unsigned long long bw;
> > > A  A  A  A unsigned long period;
> > > A  A  A  A unsigned long pause = 0;
> > > @@ -616,7 +617,7 @@ static void balance_dirty_pages(struct a
> > > A  A  A  A  A  A  A  A  A  A  A  A break;
> > >
> > > A  A  A  A  A  A  A  A bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh, nr_dirty);
> > > - A  A  A  A  A  A  A  bdi_thresh = task_dirty_limit(current, bdi_thresh);
> > > + A  A  A  A  A  A  A  task_thresh = task_dirty_limit(current, bdi_thresh);
> > >
> > > A  A  A  A  A  A  A  A /*
> > > A  A  A  A  A  A  A  A  * In order to avoid the stacked BDI deadlock we need
> > > @@ -638,14 +639,23 @@ static void balance_dirty_pages(struct a
> > >
> > > A  A  A  A  A  A  A  A bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
> > >
> > > - A  A  A  A  A  A  A  if (bdi_dirty >= bdi_thresh || nr_dirty > dirty_thresh) {
> > > + A  A  A  A  A  A  A  if (bdi_dirty >= task_thresh || nr_dirty > dirty_thresh) {
> > > A  A  A  A  A  A  A  A  A  A  A  A pause = MAX_PAUSE;
> > > A  A  A  A  A  A  A  A  A  A  A  A goto pause;
> > > A  A  A  A  A  A  A  A }
> > >
> > > + A  A  A  A  A  A  A  /*
> > > + A  A  A  A  A  A  A  A * When bdi_dirty grows closer to bdi_thresh, it indicates more
> > > + A  A  A  A  A  A  A  A * concurrent dirtiers. Proportionally lower the max throttle
> > > + A  A  A  A  A  A  A  A * bandwidth. This will resist bdi_dirty from approaching to
> > > + A  A  A  A  A  A  A  A * close to task_thresh, and help reduce fluctuations of pause
> > > + A  A  A  A  A  A  A  A * time when there are lots of dirtiers.
> > > + A  A  A  A  A  A  A  A */
> > > A  A  A  A  A  A  A  A bw = bdi->write_bandwidth;
> > > -
> > > A  A  A  A  A  A  A  A bw = bw * (bdi_thresh - bdi_dirty);
> > > + A  A  A  A  A  A  A  do_div(bw, bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);
> > > +
> > > + A  A  A  A  A  A  A  bw = bw * (task_thresh - bdi_dirty);
> > > A  A  A  A  A  A  A  A do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
> > 
> > Maybe changing this line to "do_div(bw, task_thresh /
> > TASK_SOFT_DIRTY_LIMIT + 1);"
> > is more consistent.
> 
> I'll show you another consistency of "shape" :)
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/light-dirtier-control-line.svg
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/heavy-dirtier-control-line.svg
> 
> In the above two figures, the overall control lines for light/heavy
> dirtier tasks have exactly the same shape -- it's merely shifted in
> the X axis direction. So the current form is actually more simple.

Sorry it's not the overall control lines that's simply shifted, but
the task control line.

bdi control line:
> > > A  A  A  A  A  A  A  A bw = bw * (bdi_thresh - bdi_dirty);
> > > + A  A  A  A  A  A  A  do_div(bw, bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);

task control line:
> > > + A  A  A  A  A  A  A  bw = bw * (task_thresh - bdi_dirty);
> > > A  A  A  A  A  A  A  A do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);

The use of bdi_thresh in the last line makes sure all task control
lines are of the same slope.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time
  2010-12-13 14:46   ` Wu Fengguang
@ 2010-12-14 13:37     ` Richard Kennedy
  -1 siblings, 0 replies; 202+ messages in thread
From: Richard Kennedy @ 2010-12-14 13:37 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: Type: text/plain, Size: 2714 bytes --]

On Mon, 2010-12-13 at 22:46 +0800, Wu Fengguang wrote:
> plain text document attachment
> (writeback-speedup-per-bdi-threshold-ramp-up.patch)
> Reduce the dampening for the control system, yielding faster
> convergence.
> 
> Currently it converges at a snail's pace for slow devices (in order of
> minutes).  For really fast storage, the convergence speed should be fine.
> 
> It makes sense to make it reasonably fast for typical desktops.
> 
> After patch, it converges in ~10 seconds for 60MB/s writes and 4GB mem.
> So expect ~1s for a fast 600MB/s storage under 4GB mem, or ~4s under
> 16GB mem, which seems reasonable.
> 
> $ while true; do grep BdiDirtyThresh /debug/bdi/8:0/stats; sleep 1; done
> BdiDirtyThresh:            0 kB
> BdiDirtyThresh:       118748 kB
> BdiDirtyThresh:       214280 kB
> BdiDirtyThresh:       303868 kB
> BdiDirtyThresh:       376528 kB
> BdiDirtyThresh:       411180 kB
> BdiDirtyThresh:       448636 kB
> BdiDirtyThresh:       472260 kB
> BdiDirtyThresh:       490924 kB
> BdiDirtyThresh:       499596 kB
> BdiDirtyThresh:       507068 kB
> ...
> DirtyThresh:          530392 kB
> 
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> CC: Richard Kennedy <richard@rsk.demon.co.uk>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
> @@ -145,7 +145,7 @@ static int calc_period_shift(void)
>  	else
>  		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
>  				100;
> -	return 2 + ilog2(dirty_total - 1);
> +	return ilog2(dirty_total - 1) - 1;
>  }
>  
>  /*
> 
> 
Hi Fengguang,

I've been running my test set on your v3 series and generally it's
giving good results in line with the mainline kernel, with much less
variability and lower standard deviation of the results so it is much
more repeatable.

However, it doesn't seem to be honouring the background_dirty_threshold.

The attached graph is from a simple fio write test of 400Mb on ext4.
All dirty pages are completely written in 15 seconds, but I expect to
see up to background_dirty_threshold pages staying dirty until the 30
second background task writes them out. So it is much too eager to write
back dirty pages.

As to the ramp up time, when writing to 2 disks at the same time I see
the per_bdi_threshold taking up to 20 seconds to converge on a steady
value after one of the write stops. So I think this could be speeded up
even more, at least on my setup.

I am just about to start testing v4 & will report anything interesting.

regards
Richard
  

 


[-- Attachment #2: dirty.png --]
[-- Type: image/png, Size: 3516 bytes --]

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time
@ 2010-12-14 13:37     ` Richard Kennedy
  0 siblings, 0 replies; 202+ messages in thread
From: Richard Kennedy @ 2010-12-14 13:37 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: Type: text/plain, Size: 2714 bytes --]

On Mon, 2010-12-13 at 22:46 +0800, Wu Fengguang wrote:
> plain text document attachment
> (writeback-speedup-per-bdi-threshold-ramp-up.patch)
> Reduce the dampening for the control system, yielding faster
> convergence.
> 
> Currently it converges at a snail's pace for slow devices (in order of
> minutes).  For really fast storage, the convergence speed should be fine.
> 
> It makes sense to make it reasonably fast for typical desktops.
> 
> After patch, it converges in ~10 seconds for 60MB/s writes and 4GB mem.
> So expect ~1s for a fast 600MB/s storage under 4GB mem, or ~4s under
> 16GB mem, which seems reasonable.
> 
> $ while true; do grep BdiDirtyThresh /debug/bdi/8:0/stats; sleep 1; done
> BdiDirtyThresh:            0 kB
> BdiDirtyThresh:       118748 kB
> BdiDirtyThresh:       214280 kB
> BdiDirtyThresh:       303868 kB
> BdiDirtyThresh:       376528 kB
> BdiDirtyThresh:       411180 kB
> BdiDirtyThresh:       448636 kB
> BdiDirtyThresh:       472260 kB
> BdiDirtyThresh:       490924 kB
> BdiDirtyThresh:       499596 kB
> BdiDirtyThresh:       507068 kB
> ...
> DirtyThresh:          530392 kB
> 
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> CC: Richard Kennedy <richard@rsk.demon.co.uk>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
> @@ -145,7 +145,7 @@ static int calc_period_shift(void)
>  	else
>  		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
>  				100;
> -	return 2 + ilog2(dirty_total - 1);
> +	return ilog2(dirty_total - 1) - 1;
>  }
>  
>  /*
> 
> 
Hi Fengguang,

I've been running my test set on your v3 series and generally it's
giving good results in line with the mainline kernel, with much less
variability and lower standard deviation of the results so it is much
more repeatable.

However, it doesn't seem to be honouring the background_dirty_threshold.

The attached graph is from a simple fio write test of 400Mb on ext4.
All dirty pages are completely written in 15 seconds, but I expect to
see up to background_dirty_threshold pages staying dirty until the 30
second background task writes them out. So it is much too eager to write
back dirty pages.

As to the ramp up time, when writing to 2 disks at the same time I see
the per_bdi_threshold taking up to 20 seconds to converge on a steady
value after one of the write stops. So I think this could be speeded up
even more, at least on my setup.

I am just about to start testing v4 & will report anything interesting.

regards
Richard
  

 


[-- Attachment #2: dirty.png --]
[-- Type: image/png, Size: 3516 bytes --]

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time
  2010-12-14 13:37     ` Richard Kennedy
@ 2010-12-14 13:59       ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14 13:59 UTC (permalink / raw)
  To: Richard Kennedy
  Cc: Andrew Morton, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

Hi Richard,

On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
> On Mon, 2010-12-13 at 22:46 +0800, Wu Fengguang wrote:
> > plain text document attachment
> > (writeback-speedup-per-bdi-threshold-ramp-up.patch)
> > Reduce the dampening for the control system, yielding faster
> > convergence.
> > 
> > Currently it converges at a snail's pace for slow devices (in order of
> > minutes).  For really fast storage, the convergence speed should be fine.
> > 
> > It makes sense to make it reasonably fast for typical desktops.
> > 
> > After patch, it converges in ~10 seconds for 60MB/s writes and 4GB mem.
> > So expect ~1s for a fast 600MB/s storage under 4GB mem, or ~4s under
> > 16GB mem, which seems reasonable.
> > 
> > $ while true; do grep BdiDirtyThresh /debug/bdi/8:0/stats; sleep 1; done
> > BdiDirtyThresh:            0 kB
> > BdiDirtyThresh:       118748 kB
> > BdiDirtyThresh:       214280 kB
> > BdiDirtyThresh:       303868 kB
> > BdiDirtyThresh:       376528 kB
> > BdiDirtyThresh:       411180 kB
> > BdiDirtyThresh:       448636 kB
> > BdiDirtyThresh:       472260 kB
> > BdiDirtyThresh:       490924 kB
> > BdiDirtyThresh:       499596 kB
> > BdiDirtyThresh:       507068 kB
> > ...
> > DirtyThresh:          530392 kB
> > 
> > CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > CC: Richard Kennedy <richard@rsk.demon.co.uk>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/page-writeback.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > --- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
> > @@ -145,7 +145,7 @@ static int calc_period_shift(void)
> >  	else
> >  		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> >  				100;
> > -	return 2 + ilog2(dirty_total - 1);
> > +	return ilog2(dirty_total - 1) - 1;
> >  }
> >  
> >  /*
> > 
> > 
> Hi Fengguang,
> 
> I've been running my test set on your v3 series and generally it's
> giving good results in line with the mainline kernel, with much less
> variability and lower standard deviation of the results so it is much
> more repeatable.

Glad to hear that, and thank you very much for trying it out!

> However, it doesn't seem to be honouring the background_dirty_threshold.

> The attached graph is from a simple fio write test of 400Mb on ext4.
> All dirty pages are completely written in 15 seconds, but I expect to
> see up to background_dirty_threshold pages staying dirty until the 30
> second background task writes them out. So it is much too eager to write
> back dirty pages.
 
This is interesting, and seems easy to root cause. When testing v4,
would you help collect the following trace events?

echo 1 > /debug/tracing/events/writeback/balance_dirty_pages/enable
echo 1 > /debug/tracing/events/writeback/balance_dirty_state/enable
echo 1 > /debug/tracing/events/writeback/writeback_single_inode/enable

They'll have good opportunity to disclose the bug.

> As to the ramp up time, when writing to 2 disks at the same time I see
> the per_bdi_threshold taking up to 20 seconds to converge on a steady
> value after one of the write stops. So I think this could be speeded up
> even more, at least on my setup.

I have the roughly same ramp up time on the 1-disk 3GB mem test:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
 
Given that it's the typical desktop, it does seem reasonable to speed
it up further.

> I am just about to start testing v4 & will report anything interesting.

Thanks!

Fengguang

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time
@ 2010-12-14 13:59       ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14 13:59 UTC (permalink / raw)
  To: Richard Kennedy
  Cc: Andrew Morton, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

Hi Richard,

On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
> On Mon, 2010-12-13 at 22:46 +0800, Wu Fengguang wrote:
> > plain text document attachment
> > (writeback-speedup-per-bdi-threshold-ramp-up.patch)
> > Reduce the dampening for the control system, yielding faster
> > convergence.
> > 
> > Currently it converges at a snail's pace for slow devices (in order of
> > minutes).  For really fast storage, the convergence speed should be fine.
> > 
> > It makes sense to make it reasonably fast for typical desktops.
> > 
> > After patch, it converges in ~10 seconds for 60MB/s writes and 4GB mem.
> > So expect ~1s for a fast 600MB/s storage under 4GB mem, or ~4s under
> > 16GB mem, which seems reasonable.
> > 
> > $ while true; do grep BdiDirtyThresh /debug/bdi/8:0/stats; sleep 1; done
> > BdiDirtyThresh:            0 kB
> > BdiDirtyThresh:       118748 kB
> > BdiDirtyThresh:       214280 kB
> > BdiDirtyThresh:       303868 kB
> > BdiDirtyThresh:       376528 kB
> > BdiDirtyThresh:       411180 kB
> > BdiDirtyThresh:       448636 kB
> > BdiDirtyThresh:       472260 kB
> > BdiDirtyThresh:       490924 kB
> > BdiDirtyThresh:       499596 kB
> > BdiDirtyThresh:       507068 kB
> > ...
> > DirtyThresh:          530392 kB
> > 
> > CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > CC: Richard Kennedy <richard@rsk.demon.co.uk>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/page-writeback.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > --- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
> > @@ -145,7 +145,7 @@ static int calc_period_shift(void)
> >  	else
> >  		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> >  				100;
> > -	return 2 + ilog2(dirty_total - 1);
> > +	return ilog2(dirty_total - 1) - 1;
> >  }
> >  
> >  /*
> > 
> > 
> Hi Fengguang,
> 
> I've been running my test set on your v3 series and generally it's
> giving good results in line with the mainline kernel, with much less
> variability and lower standard deviation of the results so it is much
> more repeatable.

Glad to hear that, and thank you very much for trying it out!

> However, it doesn't seem to be honouring the background_dirty_threshold.

> The attached graph is from a simple fio write test of 400Mb on ext4.
> All dirty pages are completely written in 15 seconds, but I expect to
> see up to background_dirty_threshold pages staying dirty until the 30
> second background task writes them out. So it is much too eager to write
> back dirty pages.
 
This is interesting, and seems easy to root cause. When testing v4,
would you help collect the following trace events?

echo 1 > /debug/tracing/events/writeback/balance_dirty_pages/enable
echo 1 > /debug/tracing/events/writeback/balance_dirty_state/enable
echo 1 > /debug/tracing/events/writeback/writeback_single_inode/enable

They'll have good opportunity to disclose the bug.

> As to the ramp up time, when writing to 2 disks at the same time I see
> the per_bdi_threshold taking up to 20 seconds to converge on a steady
> value after one of the write stops. So I think this could be speeded up
> even more, at least on my setup.

I have the roughly same ramp up time on the 1-disk 3GB mem test:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
 
Given that it's the typical desktop, it does seem reasonable to speed
it up further.

> I am just about to start testing v4 & will report anything interesting.

Thanks!

Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time
  2010-12-14 13:59       ` Wu Fengguang
@ 2010-12-14 14:33         ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14 14:33 UTC (permalink / raw)
  To: Richard Kennedy
  Cc: Andrew Morton, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, Dec 14, 2010 at 09:59:10PM +0800, Wu Fengguang wrote:
> On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:

> > As to the ramp up time, when writing to 2 disks at the same time I see
> > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > value after one of the write stops. So I think this could be speeded up
> > even more, at least on my setup.
> 
> I have the roughly same ramp up time on the 1-disk 3GB mem test:
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
>  

Interestingly, the above graph shows that after about 10s fast ramp
up, there is another 20s slow ramp down. It's obviously due the
decline of global limit:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat-dirty.png

But why is the global limit declining?  The following log shows that
nr_file_pages keeps growing and goes stable after 75 seconds (so long
time!). In the same period nr_free_pages goes slowly down to its
stable value. Given that the global limit is mainly derived from
nr_free_pages+nr_file_pages (I disabled swap), something must be
slowly eating memory until 75 ms. Maybe the tracing ring buffers?

         free     file      reclaimable pages
50s      369324 + 318760 => 688084
60s      235989 + 448096 => 684085

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time
@ 2010-12-14 14:33         ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14 14:33 UTC (permalink / raw)
  To: Richard Kennedy
  Cc: Andrew Morton, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, Dec 14, 2010 at 09:59:10PM +0800, Wu Fengguang wrote:
> On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:

> > As to the ramp up time, when writing to 2 disks at the same time I see
> > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > value after one of the write stops. So I think this could be speeded up
> > even more, at least on my setup.
> 
> I have the roughly same ramp up time on the 1-disk 3GB mem test:
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
>  

Interestingly, the above graph shows that after about 10s fast ramp
up, there is another 20s slow ramp down. It's obviously due the
decline of global limit:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat-dirty.png

But why is the global limit declining?  The following log shows that
nr_file_pages keeps growing and goes stable after 75 seconds (so long
time!). In the same period nr_free_pages goes slowly down to its
stable value. Given that the global limit is mainly derived from
nr_free_pages+nr_file_pages (I disabled swap), something must be
slowly eating memory until 75 ms. Maybe the tracing ring buffers?

         free     file      reclaimable pages
50s      369324 + 318760 => 688084
60s      235989 + 448096 => 684085

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time
  2010-12-14 14:33         ` Wu Fengguang
@ 2010-12-14 14:39           ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14 14:39 UTC (permalink / raw)
  To: Richard Kennedy
  Cc: Andrew Morton, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, Dec 14, 2010 at 10:33:25PM +0800, Wu Fengguang wrote:
> On Tue, Dec 14, 2010 at 09:59:10PM +0800, Wu Fengguang wrote:
> > On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
> 
> > > As to the ramp up time, when writing to 2 disks at the same time I see
> > > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > > value after one of the write stops. So I think this could be speeded up
> > > even more, at least on my setup.
> > 
> > I have the roughly same ramp up time on the 1-disk 3GB mem test:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
> >  
> 
> Interestingly, the above graph shows that after about 10s fast ramp
> up, there is another 20s slow ramp down. It's obviously due the
> decline of global limit:
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat-dirty.png
> 
> But why is the global limit declining?  The following log shows that
> nr_file_pages keeps growing and goes stable after 75 seconds (so long
> time!). In the same period nr_free_pages goes slowly down to its
> stable value. Given that the global limit is mainly derived from
> nr_free_pages+nr_file_pages (I disabled swap), something must be
> slowly eating memory until 75 ms. Maybe the tracing ring buffers?
> 
>          free     file      reclaimable pages
> 50s      369324 + 318760 => 688084
> 60s      235989 + 448096 => 684085
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat

The log shows that ~64MB reclaimable memory is stoled. But the trace
data only takes 1.8MB. Hmm..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time
@ 2010-12-14 14:39           ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14 14:39 UTC (permalink / raw)
  To: Richard Kennedy
  Cc: Andrew Morton, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, Dec 14, 2010 at 10:33:25PM +0800, Wu Fengguang wrote:
> On Tue, Dec 14, 2010 at 09:59:10PM +0800, Wu Fengguang wrote:
> > On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
> 
> > > As to the ramp up time, when writing to 2 disks at the same time I see
> > > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > > value after one of the write stops. So I think this could be speeded up
> > > even more, at least on my setup.
> > 
> > I have the roughly same ramp up time on the 1-disk 3GB mem test:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
> >  
> 
> Interestingly, the above graph shows that after about 10s fast ramp
> up, there is another 20s slow ramp down. It's obviously due the
> decline of global limit:
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat-dirty.png
> 
> But why is the global limit declining?  The following log shows that
> nr_file_pages keeps growing and goes stable after 75 seconds (so long
> time!). In the same period nr_free_pages goes slowly down to its
> stable value. Given that the global limit is mainly derived from
> nr_free_pages+nr_file_pages (I disabled swap), something must be
> slowly eating memory until 75 ms. Maybe the tracing ring buffers?
> 
>          free     file      reclaimable pages
> 50s      369324 + 318760 => 688084
> 60s      235989 + 448096 => 684085
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat

The log shows that ~64MB reclaimable memory is stoled. But the trace
data only takes 1.8MB. Hmm..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time
  2010-12-14 14:39           ` Wu Fengguang
  (?)
@ 2010-12-14 14:50             ` Peter Zijlstra
  -1 siblings, 0 replies; 202+ messages in thread
From: Peter Zijlstra @ 2010-12-14 14:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Richard Kennedy, Andrew Morton, Jan Kara, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, 2010-12-14 at 22:39 +0800, Wu Fengguang wrote:
> On Tue, Dec 14, 2010 at 10:33:25PM +0800, Wu Fengguang wrote:
> > On Tue, Dec 14, 2010 at 09:59:10PM +0800, Wu Fengguang wrote:
> > > On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
> > 
> > > > As to the ramp up time, when writing to 2 disks at the same time I see
> > > > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > > > value after one of the write stops. So I think this could be speeded up
> > > > even more, at least on my setup.
> > > 
> > > I have the roughly same ramp up time on the 1-disk 3GB mem test:
> > > 
> > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
> > >  
> > 
> > Interestingly, the above graph shows that after about 10s fast ramp
> > up, there is another 20s slow ramp down. It's obviously due the
> > decline of global limit:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat-dirty.png
> > 
> > But why is the global limit declining?  The following log shows that
> > nr_file_pages keeps growing and goes stable after 75 seconds (so long
> > time!). In the same period nr_free_pages goes slowly down to its
> > stable value. Given that the global limit is mainly derived from
> > nr_free_pages+nr_file_pages (I disabled swap), something must be
> > slowly eating memory until 75 ms. Maybe the tracing ring buffers?
> > 
> >          free     file      reclaimable pages
> > 50s      369324 + 318760 => 688084
> > 60s      235989 + 448096 => 684085
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat
> 
> The log shows that ~64MB reclaimable memory is stoled. But the trace
> data only takes 1.8MB. Hmm..

Also, trace buffers are fully pre-allocated.

Inodes perhaps?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time
@ 2010-12-14 14:50             ` Peter Zijlstra
  0 siblings, 0 replies; 202+ messages in thread
From: Peter Zijlstra @ 2010-12-14 14:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Richard Kennedy, Andrew Morton, Jan Kara, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, 2010-12-14 at 22:39 +0800, Wu Fengguang wrote:
> On Tue, Dec 14, 2010 at 10:33:25PM +0800, Wu Fengguang wrote:
> > On Tue, Dec 14, 2010 at 09:59:10PM +0800, Wu Fengguang wrote:
> > > On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
> > 
> > > > As to the ramp up time, when writing to 2 disks at the same time I see
> > > > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > > > value after one of the write stops. So I think this could be speeded up
> > > > even more, at least on my setup.
> > > 
> > > I have the roughly same ramp up time on the 1-disk 3GB mem test:
> > > 
> > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
> > >  
> > 
> > Interestingly, the above graph shows that after about 10s fast ramp
> > up, there is another 20s slow ramp down. It's obviously due the
> > decline of global limit:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat-dirty.png
> > 
> > But why is the global limit declining?  The following log shows that
> > nr_file_pages keeps growing and goes stable after 75 seconds (so long
> > time!). In the same period nr_free_pages goes slowly down to its
> > stable value. Given that the global limit is mainly derived from
> > nr_free_pages+nr_file_pages (I disabled swap), something must be
> > slowly eating memory until 75 ms. Maybe the tracing ring buffers?
> > 
> >          free     file      reclaimable pages
> > 50s      369324 + 318760 => 688084
> > 60s      235989 + 448096 => 684085
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat
> 
> The log shows that ~64MB reclaimable memory is stoled. But the trace
> data only takes 1.8MB. Hmm..

Also, trace buffers are fully pre-allocated.

Inodes perhaps?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time
@ 2010-12-14 14:50             ` Peter Zijlstra
  0 siblings, 0 replies; 202+ messages in thread
From: Peter Zijlstra @ 2010-12-14 14:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Richard Kennedy, Andrew Morton, Jan Kara, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, 2010-12-14 at 22:39 +0800, Wu Fengguang wrote:
> On Tue, Dec 14, 2010 at 10:33:25PM +0800, Wu Fengguang wrote:
> > On Tue, Dec 14, 2010 at 09:59:10PM +0800, Wu Fengguang wrote:
> > > On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
> > 
> > > > As to the ramp up time, when writing to 2 disks at the same time I see
> > > > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > > > value after one of the write stops. So I think this could be speeded up
> > > > even more, at least on my setup.
> > > 
> > > I have the roughly same ramp up time on the 1-disk 3GB mem test:
> > > 
> > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
> > >  
> > 
> > Interestingly, the above graph shows that after about 10s fast ramp
> > up, there is another 20s slow ramp down. It's obviously due the
> > decline of global limit:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat-dirty.png
> > 
> > But why is the global limit declining?  The following log shows that
> > nr_file_pages keeps growing and goes stable after 75 seconds (so long
> > time!). In the same period nr_free_pages goes slowly down to its
> > stable value. Given that the global limit is mainly derived from
> > nr_free_pages+nr_file_pages (I disabled swap), something must be
> > slowly eating memory until 75 ms. Maybe the tracing ring buffers?
> > 
> >          free     file      reclaimable pages
> > 50s      369324 + 318760 => 688084
> > 60s      235989 + 448096 => 684085
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat
> 
> The log shows that ~64MB reclaimable memory is stoled. But the trace
> data only takes 1.8MB. Hmm..

Also, trace buffers are fully pre-allocated.

Inodes perhaps?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time
  2010-12-14 14:39           ` Wu Fengguang
@ 2010-12-14 14:56             ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14 14:56 UTC (permalink / raw)
  To: Richard Kennedy
  Cc: Andrew Morton, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, Dec 14, 2010 at 10:39:02PM +0800, Wu Fengguang wrote:
> On Tue, Dec 14, 2010 at 10:33:25PM +0800, Wu Fengguang wrote:
> > On Tue, Dec 14, 2010 at 09:59:10PM +0800, Wu Fengguang wrote:
> > > On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
> > 
> > > > As to the ramp up time, when writing to 2 disks at the same time I see
> > > > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > > > value after one of the write stops. So I think this could be speeded up
> > > > even more, at least on my setup.
> > > 
> > > I have the roughly same ramp up time on the 1-disk 3GB mem test:
> > > 
> > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
> > >  
> > 
> > Interestingly, the above graph shows that after about 10s fast ramp
> > up, there is another 20s slow ramp down. It's obviously due the
> > decline of global limit:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat-dirty.png
> > 
> > But why is the global limit declining?  The following log shows that
> > nr_file_pages keeps growing and goes stable after 75 seconds (so long
> > time!). In the same period nr_free_pages goes slowly down to its
> > stable value. Given that the global limit is mainly derived from
> > nr_free_pages+nr_file_pages (I disabled swap), something must be
> > slowly eating memory until 75 ms. Maybe the tracing ring buffers?
> > 
> >          free     file      reclaimable pages
> > 50s      369324 + 318760 => 688084
> > 60s      235989 + 448096 => 684085
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat
> 
> The log shows that ~64MB reclaimable memory is stoled. But the trace
> data only takes 1.8MB. Hmm..

ext2 has the same pattern:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext2-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-01-36/dirty-pages.png

But it does not happen for btrfs!

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5-2010-12-10-21-23/vmstat-dirty.png

Seems that it's the nr_slab_reclaimable keep growing until 75s.

Looking at
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext2-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-01-36/slabinfo-end

It should be the buffer heads that slowly eats the memory during the time:

buffer_head       670304 670662    104   37    1 : tunables  120   60 8 : slabdata  18117  18126    480

(670304/37)*4 = 72464KB.

The consumption seems acceptable for a 3G memory system.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time
@ 2010-12-14 14:56             ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14 14:56 UTC (permalink / raw)
  To: Richard Kennedy
  Cc: Andrew Morton, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, Dec 14, 2010 at 10:39:02PM +0800, Wu Fengguang wrote:
> On Tue, Dec 14, 2010 at 10:33:25PM +0800, Wu Fengguang wrote:
> > On Tue, Dec 14, 2010 at 09:59:10PM +0800, Wu Fengguang wrote:
> > > On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
> > 
> > > > As to the ramp up time, when writing to 2 disks at the same time I see
> > > > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > > > value after one of the write stops. So I think this could be speeded up
> > > > even more, at least on my setup.
> > > 
> > > I have the roughly same ramp up time on the 1-disk 3GB mem test:
> > > 
> > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
> > >  
> > 
> > Interestingly, the above graph shows that after about 10s fast ramp
> > up, there is another 20s slow ramp down. It's obviously due the
> > decline of global limit:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat-dirty.png
> > 
> > But why is the global limit declining?  The following log shows that
> > nr_file_pages keeps growing and goes stable after 75 seconds (so long
> > time!). In the same period nr_free_pages goes slowly down to its
> > stable value. Given that the global limit is mainly derived from
> > nr_free_pages+nr_file_pages (I disabled swap), something must be
> > slowly eating memory until 75 ms. Maybe the tracing ring buffers?
> > 
> >          free     file      reclaimable pages
> > 50s      369324 + 318760 => 688084
> > 60s      235989 + 448096 => 684085
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat
> 
> The log shows that ~64MB reclaimable memory is stoled. But the trace
> data only takes 1.8MB. Hmm..

ext2 has the same pattern:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext2-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-01-36/dirty-pages.png

But it does not happen for btrfs!

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5-2010-12-10-21-23/vmstat-dirty.png

Seems that it's the nr_slab_reclaimable keep growing until 75s.

Looking at
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext2-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-01-36/slabinfo-end

It should be the buffer heads that slowly eats the memory during the time:

buffer_head       670304 670662    104   37    1 : tunables  120   60 8 : slabdata  18117  18126    480

(670304/37)*4 = 72464KB.

The consumption seems acceptable for a 3G memory system.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time
  2010-12-14 14:50             ` Peter Zijlstra
  (?)
  (?)
@ 2010-12-14 15:15             ` Wu Fengguang
  2010-12-14 15:26               ` Wu Fengguang
  -1 siblings, 1 reply; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14 15:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Richard Kennedy, Andrew Morton, Jan Kara, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: Type: text/plain, Size: 2416 bytes --]

On Tue, Dec 14, 2010 at 10:50:55PM +0800, Peter Zijlstra wrote:
> On Tue, 2010-12-14 at 22:39 +0800, Wu Fengguang wrote:
> > On Tue, Dec 14, 2010 at 10:33:25PM +0800, Wu Fengguang wrote:
> > > On Tue, Dec 14, 2010 at 09:59:10PM +0800, Wu Fengguang wrote:
> > > > On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
> > > 
> > > > > As to the ramp up time, when writing to 2 disks at the same time I see
> > > > > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > > > > value after one of the write stops. So I think this could be speeded up
> > > > > even more, at least on my setup.
> > > > 
> > > > I have the roughly same ramp up time on the 1-disk 3GB mem test:
> > > > 
> > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
> > > >  
> > > 
> > > Interestingly, the above graph shows that after about 10s fast ramp
> > > up, there is another 20s slow ramp down. It's obviously due the
> > > decline of global limit:
> > > 
> > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat-dirty.png
> > > 
> > > But why is the global limit declining?  The following log shows that
> > > nr_file_pages keeps growing and goes stable after 75 seconds (so long
> > > time!). In the same period nr_free_pages goes slowly down to its
> > > stable value. Given that the global limit is mainly derived from
> > > nr_free_pages+nr_file_pages (I disabled swap), something must be
> > > slowly eating memory until 75 ms. Maybe the tracing ring buffers?
> > > 
> > >          free     file      reclaimable pages
> > > 50s      369324 + 318760 => 688084
> > > 60s      235989 + 448096 => 684085
> > > 
> > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat
> > 
> > The log shows that ~64MB reclaimable memory is stoled. But the trace
> > data only takes 1.8MB. Hmm..
> 
> Also, trace buffers are fully pre-allocated.
> 
> Inodes perhaps?

Just figured out that it's the buffer heads :)

The other interesting question is, why it takes up to 50s to consume
all the nr_free_pages pages. I would imagine the free pages be quickly
allocated to the page cache..

Attached is the graph for ext2-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-01-36

Thanks,
Fengguang

[-- Attachment #2: vmstat-reclaimable-500.png --]
[-- Type: image/png, Size: 66540 bytes --]

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time
  2010-12-14 15:15             ` Wu Fengguang
@ 2010-12-14 15:26               ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14 15:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Richard Kennedy, Andrew Morton, Jan Kara, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: Type: text/plain, Size: 3067 bytes --]

On Tue, Dec 14, 2010 at 11:15:07PM +0800, Wu Fengguang wrote:
> On Tue, Dec 14, 2010 at 10:50:55PM +0800, Peter Zijlstra wrote:
> > On Tue, 2010-12-14 at 22:39 +0800, Wu Fengguang wrote:
> > > On Tue, Dec 14, 2010 at 10:33:25PM +0800, Wu Fengguang wrote:
> > > > On Tue, Dec 14, 2010 at 09:59:10PM +0800, Wu Fengguang wrote:
> > > > > On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
> > > > 
> > > > > > As to the ramp up time, when writing to 2 disks at the same time I see
> > > > > > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > > > > > value after one of the write stops. So I think this could be speeded up
> > > > > > even more, at least on my setup.
> > > > > 
> > > > > I have the roughly same ramp up time on the 1-disk 3GB mem test:
> > > > > 
> > > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
> > > > >  
> > > > 
> > > > Interestingly, the above graph shows that after about 10s fast ramp
> > > > up, there is another 20s slow ramp down. It's obviously due the
> > > > decline of global limit:
> > > > 
> > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat-dirty.png
> > > > 
> > > > But why is the global limit declining?  The following log shows that
> > > > nr_file_pages keeps growing and goes stable after 75 seconds (so long
> > > > time!). In the same period nr_free_pages goes slowly down to its
> > > > stable value. Given that the global limit is mainly derived from
> > > > nr_free_pages+nr_file_pages (I disabled swap), something must be
> > > > slowly eating memory until 75 ms. Maybe the tracing ring buffers?
> > > > 
> > > >          free     file      reclaimable pages
> > > > 50s      369324 + 318760 => 688084
> > > > 60s      235989 + 448096 => 684085
> > > > 
> > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat
> > > 
> > > The log shows that ~64MB reclaimable memory is stoled. But the trace
> > > data only takes 1.8MB. Hmm..
> > 
> > Also, trace buffers are fully pre-allocated.
> > 
> > Inodes perhaps?
> 
> Just figured out that it's the buffer heads :)
> 
> The other interesting question is, why it takes up to 50s to consume
> all the nr_free_pages pages. I would imagine the free pages be quickly
> allocated to the page cache..
> 
> Attached is the graph for ext2-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-01-36

Ah it's embarrassing.. we are writing data and the free memory
consumption is quickly bounded by the disk write speed..

So it's FS independent.

Here is the graph for ext3 on vanilla kernel, generated from 

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext3-1dd-1M-8p-2952M-2.6.37-rc5-2010-12-10-19-57/vmstat

And btrfs on vanilla kernel

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5-2010-12-10-21-23/vmstat

Thanks,
Fengguang

[-- Attachment #2: vmstat-reclaimable-500.png --]
[-- Type: image/png, Size: 68089 bytes --]

[-- Attachment #3: vmstat-dirty-500.png --]
[-- Type: image/png, Size: 57116 bytes --]

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 29/35] nfs: in-commit pages accounting and wait queue
  2010-12-13 21:15     ` Trond Myklebust
@ 2010-12-14 15:40       ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14 15:40 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	linux-mm, linux-fsdevel, LKML

On Tue, Dec 14, 2010 at 05:15:51AM +0800, Trond Myklebust wrote:
> On Mon, 2010-12-13 at 22:47 +0800, Wu Fengguang wrote:
> > plain text document attachment (writeback-nfs-in-commit.patch)
> > When doing 10+ concurrent dd's, I observed very bumpy commits submission
> > (partly because the dd's are started at the same time, and hence reached
> > 4MB to-commit pages at the same time). Basically we rely on the server
> > to complete and return write/commit requests, and want both to progress
> > smoothly and not consume too many pages. The write request wait queue is
> > not enough as it's mainly network bounded. So add another commit request
> > wait queue. Only async writes need to sleep on this queue.
> > 
> 
> I'm not understanding the above reasoning. Why should we serialise
> commits at the per-filesystem level (and only for non-blocking flushes
> at that)?

I did the commit wait queue after seeing this graph, where there is
very bursty pattern of commit submission and hence completion:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/nfs-commit-1000.png

leading to big fluctuations, eg. the almost straight up/straight down
lines below
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/vmstat-dirty-300.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/dirty-pages.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/dirty-pages-200.png

A commit wait queue will help wipe out the "peaks". The "fixed" graph
is
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/vmstat-dirty-300.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/dirty-pages.png

Blocking flushes don't need to wait on this queue because they already
throttle themselves by waiting on the inode commit lock before/after
the commit.  They actually should not wait on this queue, to prevent
sync requests being unnecessarily blocked by async ones.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 29/35] nfs: in-commit pages accounting and wait queue
@ 2010-12-14 15:40       ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14 15:40 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	linux-mm, linux-fsdevel, LKML

On Tue, Dec 14, 2010 at 05:15:51AM +0800, Trond Myklebust wrote:
> On Mon, 2010-12-13 at 22:47 +0800, Wu Fengguang wrote:
> > plain text document attachment (writeback-nfs-in-commit.patch)
> > When doing 10+ concurrent dd's, I observed very bumpy commits submission
> > (partly because the dd's are started at the same time, and hence reached
> > 4MB to-commit pages at the same time). Basically we rely on the server
> > to complete and return write/commit requests, and want both to progress
> > smoothly and not consume too many pages. The write request wait queue is
> > not enough as it's mainly network bounded. So add another commit request
> > wait queue. Only async writes need to sleep on this queue.
> > 
> 
> I'm not understanding the above reasoning. Why should we serialise
> commits at the per-filesystem level (and only for non-blocking flushes
> at that)?

I did the commit wait queue after seeing this graph, where there is
very bursty pattern of commit submission and hence completion:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/nfs-commit-1000.png

leading to big fluctuations, eg. the almost straight up/straight down
lines below
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/vmstat-dirty-300.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/dirty-pages.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/dirty-pages-200.png

A commit wait queue will help wipe out the "peaks". The "fixed" graph
is
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/vmstat-dirty-300.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/dirty-pages.png

Blocking flushes don't need to wait on this queue because they already
throttle themselves by waiting on the inode commit lock before/after
the commit.  They actually should not wait on this queue, to prevent
sync requests being unnecessarily blocked by async ones.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 31/35] nfs: dont change wbc->nr_to_write in write_inode()
  2010-12-13 21:01     ` Trond Myklebust
@ 2010-12-14 15:53       ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14 15:53 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	linux-mm, linux-fsdevel, LKML

On Tue, Dec 14, 2010 at 05:01:44AM +0800, Trond Myklebust wrote:
> On Mon, 2010-12-13 at 22:47 +0800, Wu Fengguang wrote:
> > plain text document attachment
> > (writeback-nfs-commit-remove-nr_to_write.patch)
> > It's introduced in commit 420e3646 ("NFS: Reduce the number of
> > unnecessary COMMIT calls") and seems not necessary.
> > 
> > CC: Trond Myklebust <Trond.Myklebust@netapp.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/nfs/write.c |    9 +--------
> >  1 file changed, 1 insertion(+), 8 deletions(-)
> > 
> > --- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:21.000000000 +0800
> > +++ linux-next/fs/nfs/write.c	2010-12-13 21:46:22.000000000 +0800
> > @@ -1557,15 +1557,8 @@ static int nfs_commit_unstable_pages(str
> >  	}
> >  
> >  	ret = nfs_commit_inode(inode, flags);
> > -	if (ret >= 0) {
> > -		if (wbc->sync_mode == WB_SYNC_NONE) {
> > -			if (ret < wbc->nr_to_write)
> > -				wbc->nr_to_write -= ret;
> > -			else
> > -				wbc->nr_to_write = 0;
> > -		}
> > +	if (ret >= 0)
> >  		return 0;
> > -	}
> >  out_mark_dirty:
> >  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> >  	return ret;
> 
> It is there in order to tell the VM that it has succeeded in freeing up
> a certain number of pages. Otherwise, we end up cycling forever in
> writeback_sb_inodes() & friends with the latter not realising that they
> have made progress.

Yeah it seems reasonable, thanks for the explanation.  I'll drop it.

The decrease of nr_to_write seems a partial solution. It will return
control to wb_writeback(), however the function may still busy loop
for long time without doing anything, when all the unstable pages are
in-commit pages.

Strictly speaking, over_bground_thresh() should only check the number
of to-commit pages, because the flusher can only commit the to-commit
pages, and can do nothing but wait for the server to response to
in-commit pages. A clean solution would involve breaking up the
current NR_UNSTABLE_NFS into two counters. But you may not like the
side effect that more dirty pages will then be cached in NFS client,
as the background flusher will quit more earlier :)

As a simple fix, I have a patch to avoid such possible busy loop.

Thanks,
Fengguang
---

Subject: writeback: sleep for 10ms when nothing is written
Date: Fri Dec 03 18:31:59 CST 2010

It seems more safe to take a sleep when nothing was done.

NFS background writeback could possibly busy loop in wb_writeback()
when the NFS client has sent and commit all data. It relies on the
NFS server and network condition to get the commit feedback to knock
down the NR_UNSTABLE_NFS number.

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    5 +++++
 1 file changed, 5 insertions(+)

--- linux-next.orig/fs/fs-writeback.c	2010-12-03 18:29:14.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-12-03 18:31:56.000000000 +0800
@@ -741,6 +741,11 @@ static long wb_writeback(struct bdi_writ
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.
 		 */
+		if (list_empty(&wb->b_more_io)) {
+			__set_current_state(TASK_UNINTERRUPTIBLE);
+			io_schedule_timeout(max(HZ/100, 1));
+			continue;
+		}
 		spin_lock(&inode_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = wb_inode(wb->b_more_io.prev);

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 31/35] nfs: dont change wbc->nr_to_write in write_inode()
@ 2010-12-14 15:53       ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-14 15:53 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	linux-mm, linux-fsdevel, LKML

On Tue, Dec 14, 2010 at 05:01:44AM +0800, Trond Myklebust wrote:
> On Mon, 2010-12-13 at 22:47 +0800, Wu Fengguang wrote:
> > plain text document attachment
> > (writeback-nfs-commit-remove-nr_to_write.patch)
> > It's introduced in commit 420e3646 ("NFS: Reduce the number of
> > unnecessary COMMIT calls") and seems not necessary.
> > 
> > CC: Trond Myklebust <Trond.Myklebust@netapp.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/nfs/write.c |    9 +--------
> >  1 file changed, 1 insertion(+), 8 deletions(-)
> > 
> > --- linux-next.orig/fs/nfs/write.c	2010-12-13 21:46:21.000000000 +0800
> > +++ linux-next/fs/nfs/write.c	2010-12-13 21:46:22.000000000 +0800
> > @@ -1557,15 +1557,8 @@ static int nfs_commit_unstable_pages(str
> >  	}
> >  
> >  	ret = nfs_commit_inode(inode, flags);
> > -	if (ret >= 0) {
> > -		if (wbc->sync_mode == WB_SYNC_NONE) {
> > -			if (ret < wbc->nr_to_write)
> > -				wbc->nr_to_write -= ret;
> > -			else
> > -				wbc->nr_to_write = 0;
> > -		}
> > +	if (ret >= 0)
> >  		return 0;
> > -	}
> >  out_mark_dirty:
> >  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> >  	return ret;
> 
> It is there in order to tell the VM that it has succeeded in freeing up
> a certain number of pages. Otherwise, we end up cycling forever in
> writeback_sb_inodes() & friends with the latter not realising that they
> have made progress.

Yeah it seems reasonable, thanks for the explanation.  I'll drop it.

The decrease of nr_to_write seems a partial solution. It will return
control to wb_writeback(), however the function may still busy loop
for long time without doing anything, when all the unstable pages are
in-commit pages.

Strictly speaking, over_bground_thresh() should only check the number
of to-commit pages, because the flusher can only commit the to-commit
pages, and can do nothing but wait for the server to response to
in-commit pages. A clean solution would involve breaking up the
current NR_UNSTABLE_NFS into two counters. But you may not like the
side effect that more dirty pages will then be cached in NFS client,
as the background flusher will quit more earlier :)

As a simple fix, I have a patch to avoid such possible busy loop.

Thanks,
Fengguang
---

Subject: writeback: sleep for 10ms when nothing is written
Date: Fri Dec 03 18:31:59 CST 2010

It seems more safe to take a sleep when nothing was done.

NFS background writeback could possibly busy loop in wb_writeback()
when the NFS client has sent and commit all data. It relies on the
NFS server and network condition to get the commit feedback to knock
down the NR_UNSTABLE_NFS number.

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    5 +++++
 1 file changed, 5 insertions(+)

--- linux-next.orig/fs/fs-writeback.c	2010-12-03 18:29:14.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-12-03 18:31:56.000000000 +0800
@@ -741,6 +741,11 @@ static long wb_writeback(struct bdi_writ
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.
 		 */
+		if (list_empty(&wb->b_more_io)) {
+			__set_current_state(TASK_UNINTERRUPTIBLE);
+			io_schedule_timeout(max(HZ/100, 1));
+			continue;
+		}
 		spin_lock(&inode_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = wb_inode(wb->b_more_io.prev);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 29/35] nfs: in-commit pages accounting and wait queue
  2010-12-14 15:40       ` Wu Fengguang
@ 2010-12-14 15:57         ` Trond Myklebust
  -1 siblings, 0 replies; 202+ messages in thread
From: Trond Myklebust @ 2010-12-14 15:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	linux-mm, linux-fsdevel, LKML

On Tue, 2010-12-14 at 23:40 +0800, Wu Fengguang wrote:
> On Tue, Dec 14, 2010 at 05:15:51AM +0800, Trond Myklebust wrote:
> > On Mon, 2010-12-13 at 22:47 +0800, Wu Fengguang wrote:
> > > plain text document attachment (writeback-nfs-in-commit.patch)
> > > When doing 10+ concurrent dd's, I observed very bumpy commits submission
> > > (partly because the dd's are started at the same time, and hence reached
> > > 4MB to-commit pages at the same time). Basically we rely on the server
> > > to complete and return write/commit requests, and want both to progress
> > > smoothly and not consume too many pages. The write request wait queue is
> > > not enough as it's mainly network bounded. So add another commit request
> > > wait queue. Only async writes need to sleep on this queue.
> > > 
> > 
> > I'm not understanding the above reasoning. Why should we serialise
> > commits at the per-filesystem level (and only for non-blocking flushes
> > at that)?
> 
> I did the commit wait queue after seeing this graph, where there is
> very bursty pattern of commit submission and hence completion:
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/nfs-commit-1000.png
> 
> leading to big fluctuations, eg. the almost straight up/straight down
> lines below
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/vmstat-dirty-300.png
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/dirty-pages.png
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/dirty-pages-200.png
> 
> A commit wait queue will help wipe out the "peaks". The "fixed" graph
> is
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/vmstat-dirty-300.png
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/dirty-pages.png
> 
> Blocking flushes don't need to wait on this queue because they already
> throttle themselves by waiting on the inode commit lock before/after
> the commit.  They actually should not wait on this queue, to prevent
> sync requests being unnecessarily blocked by async ones.

OK, but isn't it better then to just abort the commit, and have the
relevant async process retry it later?

This is a code path which is followed by kswapd, for instance. It seems
dangerous to be throttling that instead of allowing it to proceed (and
perhaps being able to free up memory on some other partition in the mean
time).

Cheers
  Trond

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 29/35] nfs: in-commit pages accounting and wait queue
@ 2010-12-14 15:57         ` Trond Myklebust
  0 siblings, 0 replies; 202+ messages in thread
From: Trond Myklebust @ 2010-12-14 15:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	linux-mm, linux-fsdevel, LKML

On Tue, 2010-12-14 at 23:40 +0800, Wu Fengguang wrote:
> On Tue, Dec 14, 2010 at 05:15:51AM +0800, Trond Myklebust wrote:
> > On Mon, 2010-12-13 at 22:47 +0800, Wu Fengguang wrote:
> > > plain text document attachment (writeback-nfs-in-commit.patch)
> > > When doing 10+ concurrent dd's, I observed very bumpy commits submission
> > > (partly because the dd's are started at the same time, and hence reached
> > > 4MB to-commit pages at the same time). Basically we rely on the server
> > > to complete and return write/commit requests, and want both to progress
> > > smoothly and not consume too many pages. The write request wait queue is
> > > not enough as it's mainly network bounded. So add another commit request
> > > wait queue. Only async writes need to sleep on this queue.
> > > 
> > 
> > I'm not understanding the above reasoning. Why should we serialise
> > commits at the per-filesystem level (and only for non-blocking flushes
> > at that)?
> 
> I did the commit wait queue after seeing this graph, where there is
> very bursty pattern of commit submission and hence completion:
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/nfs-commit-1000.png
> 
> leading to big fluctuations, eg. the almost straight up/straight down
> lines below
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/vmstat-dirty-300.png
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/dirty-pages.png
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/dirty-pages-200.png
> 
> A commit wait queue will help wipe out the "peaks". The "fixed" graph
> is
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/vmstat-dirty-300.png
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/dirty-pages.png
> 
> Blocking flushes don't need to wait on this queue because they already
> throttle themselves by waiting on the inode commit lock before/after
> the commit.  They actually should not wait on this queue, to prevent
> sync requests being unnecessarily blocked by async ones.

OK, but isn't it better then to just abort the commit, and have the
relevant async process retry it later?

This is a code path which is followed by kswapd, for instance. It seems
dangerous to be throttling that instead of allowing it to proceed (and
perhaps being able to free up memory on some other partition in the mean
time).

Cheers
  Trond

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 16/35] writeback: increase min pause time on concurrent dirtiers
  2010-12-14  6:51       ` Wu Fengguang
  (?)
@ 2010-12-14 18:42       ` Valdis.Kletnieks
  2010-12-14 18:55           ` Peter Zijlstra
  -1 siblings, 1 reply; 202+ messages in thread
From: Valdis.Kletnieks @ 2010-12-14 18:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Dave Chinner, Christoph Hellwig,
	Trond Myklebust, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: Type: text/plain, Size: 582 bytes --]

On Tue, 14 Dec 2010 14:51:33 +0800, Wu Fengguang said:

> > > +	/* (N * 10ms) on 2^N concurrent tasks */
> > > +	t = (hi - lo) * (10 * HZ) / 1024;
> > 
> > Either I need more caffeine, or the comment doesn't match the code
> > if HZ != 1000?
> 
> The "ms" in the comment may be confusing, but the pause time (t) is
> measured in jiffies :)  Hope the below patch helps.

No, I meant that 10 * HZ evaluates to different numbers depending what
the CONFIG_HZ parameter is set to - 100, 250, 1000, or some other
custom value.  Does this code behave correctly on a CONFIG_HZ=100 kernel?


[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 16/35] writeback: increase min pause time on concurrent dirtiers
  2010-12-14 18:42       ` Valdis.Kletnieks
  2010-12-14 18:55           ` Peter Zijlstra
@ 2010-12-14 18:55           ` Peter Zijlstra
  0 siblings, 0 replies; 202+ messages in thread
From: Peter Zijlstra @ 2010-12-14 18:55 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Wu Fengguang, Andrew Morton, Jan Kara, Dave Chinner,
	Christoph Hellwig, Trond Myklebust, Theodore Ts'o,
	Chris Mason, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, 2010-12-14 at 13:42 -0500, Valdis.Kletnieks@vt.edu wrote:
> On Tue, 14 Dec 2010 14:51:33 +0800, Wu Fengguang said:
> 
> > > > +	/* (N * 10ms) on 2^N concurrent tasks */
> > > > +	t = (hi - lo) * (10 * HZ) / 1024;
> > > 
> > > Either I need more caffeine, or the comment doesn't match the code
> > > if HZ != 1000?
> > 
> > The "ms" in the comment may be confusing, but the pause time (t) is
> > measured in jiffies :)  Hope the below patch helps.
> 
> No, I meant that 10 * HZ evaluates to different numbers depending what
> the CONFIG_HZ parameter is set to - 100, 250, 1000, or some other
> custom value.  Does this code behave correctly on a CONFIG_HZ=100 kernel?

10*HZ = 10 seconds
(10*HZ) / 1024 ~= 10 milliseconds

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 16/35] writeback: increase min pause time on concurrent dirtiers
@ 2010-12-14 18:55           ` Peter Zijlstra
  0 siblings, 0 replies; 202+ messages in thread
From: Peter Zijlstra @ 2010-12-14 18:55 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Wu Fengguang, Andrew Morton, Jan Kara, Dave Chinner,
	Christoph Hellwig, Trond Myklebust, Theodore Ts'o,
	Chris Mason, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, 2010-12-14 at 13:42 -0500, Valdis.Kletnieks@vt.edu wrote:
> On Tue, 14 Dec 2010 14:51:33 +0800, Wu Fengguang said:
> 
> > > > +	/* (N * 10ms) on 2^N concurrent tasks */
> > > > +	t = (hi - lo) * (10 * HZ) / 1024;
> > > 
> > > Either I need more caffeine, or the comment doesn't match the code
> > > if HZ != 1000?
> > 
> > The "ms" in the comment may be confusing, but the pause time (t) is
> > measured in jiffies :)  Hope the below patch helps.
> 
> No, I meant that 10 * HZ evaluates to different numbers depending what
> the CONFIG_HZ parameter is set to - 100, 250, 1000, or some other
> custom value.  Does this code behave correctly on a CONFIG_HZ=100 kernel?

10*HZ = 10 seconds
(10*HZ) / 1024 ~= 10 milliseconds

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 16/35] writeback: increase min pause time on concurrent dirtiers
@ 2010-12-14 18:55           ` Peter Zijlstra
  0 siblings, 0 replies; 202+ messages in thread
From: Peter Zijlstra @ 2010-12-14 18:55 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Wu Fengguang, Andrew Morton, Jan Kara, Dave Chinner,
	Christoph Hellwig, Trond Myklebust, Theodore Ts'o,
	Chris Mason, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, 2010-12-14 at 13:42 -0500, Valdis.Kletnieks@vt.edu wrote:
> On Tue, 14 Dec 2010 14:51:33 +0800, Wu Fengguang said:
> 
> > > > +	/* (N * 10ms) on 2^N concurrent tasks */
> > > > +	t = (hi - lo) * (10 * HZ) / 1024;
> > > 
> > > Either I need more caffeine, or the comment doesn't match the code
> > > if HZ != 1000?
> > 
> > The "ms" in the comment may be confusing, but the pause time (t) is
> > measured in jiffies :)  Hope the below patch helps.
> 
> No, I meant that 10 * HZ evaluates to different numbers depending what
> the CONFIG_HZ parameter is set to - 100, 250, 1000, or some other
> custom value.  Does this code behave correctly on a CONFIG_HZ=100 kernel?

10*HZ = 10 seconds
(10*HZ) / 1024 ~= 10 milliseconds

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 16/35] writeback: increase min pause time on concurrent dirtiers
  2010-12-14 18:55           ` Peter Zijlstra
  (?)
  (?)
@ 2010-12-14 20:13           ` Valdis.Kletnieks
  2010-12-14 20:24               ` Peter Zijlstra
  -1 siblings, 1 reply; 202+ messages in thread
From: Valdis.Kletnieks @ 2010-12-14 20:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, Andrew Morton, Jan Kara, Dave Chinner,
	Christoph Hellwig, Trond Myklebust, Theodore Ts'o,
	Chris Mason, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: Type: text/plain, Size: 841 bytes --]

On Tue, 14 Dec 2010 19:55:08 +0100, Peter Zijlstra said:

> 10*HZ = 10 seconds
> (10*HZ) / 1024 ~= 10 milliseconds

from include/asm-generic/param.h (which is included by x86)

#ifdef __KERNEL__
# define HZ             CONFIG_HZ       /* Internal kernel timer frequency */
# define USER_HZ        100             /* some user interfaces are */
# define CLOCKS_PER_SEC (USER_HZ)       /* in "ticks" like times() */
#endif

Note that HZ isn't USER_HZ or CLOCKS_PER_SEC  - it's CONFIG_HZ, which last
I checked is still user-settable.  If not, then there needs to be a massive cleanup
of Kconfig and defconfig:

% grep HZ .config
CONFIG_NO_HZ=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000

So you're not guaranteed that 10*HZ is 10 seconds.  10*USER_HZ, sure. But not HZ.





[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 16/35] writeback: increase min pause time on concurrent dirtiers
  2010-12-14 20:13           ` Valdis.Kletnieks
  2010-12-14 20:24               ` Peter Zijlstra
@ 2010-12-14 20:24               ` Peter Zijlstra
  0 siblings, 0 replies; 202+ messages in thread
From: Peter Zijlstra @ 2010-12-14 20:24 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Wu Fengguang, Andrew Morton, Jan Kara, Dave Chinner,
	Christoph Hellwig, Trond Myklebust, Theodore Ts'o,
	Chris Mason, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, 2010-12-14 at 15:13 -0500, Valdis.Kletnieks@vt.edu wrote:
> So you're not guaranteed that 10*HZ is 10 seconds.  10*USER_HZ, sure.
> But not HZ.

You're confused. 10*HZ jiffies is always 10 seconds. Hertz means
per-second. We take CONFIG_HZ ticks per second, so waiting HZ jiffies
makes us wait 1 second.

USER_HZ is archaic and only used to stabilize user-interfaces that for
some daft reason depended on HZ.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 16/35] writeback: increase min pause time on concurrent dirtiers
@ 2010-12-14 20:24               ` Peter Zijlstra
  0 siblings, 0 replies; 202+ messages in thread
From: Peter Zijlstra @ 2010-12-14 20:24 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Wu Fengguang, Andrew Morton, Jan Kara, Dave Chinner,
	Christoph Hellwig, Trond Myklebust, Theodore Ts'o,
	Chris Mason, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, 2010-12-14 at 15:13 -0500, Valdis.Kletnieks@vt.edu wrote:
> So you're not guaranteed that 10*HZ is 10 seconds.  10*USER_HZ, sure.
> But not HZ.

You're confused. 10*HZ jiffies is always 10 seconds. Hertz means
per-second. We take CONFIG_HZ ticks per second, so waiting HZ jiffies
makes us wait 1 second.

USER_HZ is archaic and only used to stabilize user-interfaces that for
some daft reason depended on HZ.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 16/35] writeback: increase min pause time on concurrent dirtiers
@ 2010-12-14 20:24               ` Peter Zijlstra
  0 siblings, 0 replies; 202+ messages in thread
From: Peter Zijlstra @ 2010-12-14 20:24 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Wu Fengguang, Andrew Morton, Jan Kara, Dave Chinner,
	Christoph Hellwig, Trond Myklebust, Theodore Ts'o,
	Chris Mason, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, 2010-12-14 at 15:13 -0500, Valdis.Kletnieks@vt.edu wrote:
> So you're not guaranteed that 10*HZ is 10 seconds.  10*USER_HZ, sure.
> But not HZ.

You're confused. 10*HZ jiffies is always 10 seconds. Hertz means
per-second. We take CONFIG_HZ ticks per second, so waiting HZ jiffies
makes us wait 1 second.

USER_HZ is archaic and only used to stabilize user-interfaces that for
some daft reason depended on HZ.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 16/35] writeback: increase min pause time on concurrent dirtiers
  2010-12-14 20:24               ` Peter Zijlstra
  (?)
  (?)
@ 2010-12-14 20:37               ` Valdis.Kletnieks
  -1 siblings, 0 replies; 202+ messages in thread
From: Valdis.Kletnieks @ 2010-12-14 20:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, Andrew Morton, Jan Kara, Dave Chinner,
	Christoph Hellwig, Trond Myklebust, Theodore Ts'o,
	Chris Mason, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: Type: text/plain, Size: 252 bytes --]

On Tue, 14 Dec 2010 21:24:15 +0100, Peter Zijlstra said:

> You're confused. 10*HZ jiffies is always 10 seconds.

I must be misremembering times past, when HZ was settable
but a jiffie was always 1/100th of a second...  Senility has
finally set it. :)

[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 29/35] nfs: in-commit pages accounting and wait queue
  2010-12-14 15:57         ` Trond Myklebust
@ 2010-12-15 15:07           ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-15 15:07 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	linux-mm, linux-fsdevel, LKML

On Tue, Dec 14, 2010 at 11:57:25PM +0800, Trond Myklebust wrote:
> On Tue, 2010-12-14 at 23:40 +0800, Wu Fengguang wrote:
> > On Tue, Dec 14, 2010 at 05:15:51AM +0800, Trond Myklebust wrote:
> > > On Mon, 2010-12-13 at 22:47 +0800, Wu Fengguang wrote:
> > > > plain text document attachment (writeback-nfs-in-commit.patch)
> > > > When doing 10+ concurrent dd's, I observed very bumpy commits submission
> > > > (partly because the dd's are started at the same time, and hence reached
> > > > 4MB to-commit pages at the same time). Basically we rely on the server
> > > > to complete and return write/commit requests, and want both to progress
> > > > smoothly and not consume too many pages. The write request wait queue is
> > > > not enough as it's mainly network bounded. So add another commit request
> > > > wait queue. Only async writes need to sleep on this queue.
> > > > 
> > > 
> > > I'm not understanding the above reasoning. Why should we serialise
> > > commits at the per-filesystem level (and only for non-blocking flushes
> > > at that)?
> > 
> > I did the commit wait queue after seeing this graph, where there is
> > very bursty pattern of commit submission and hence completion:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/nfs-commit-1000.png
> > 
> > leading to big fluctuations, eg. the almost straight up/straight down
> > lines below
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/vmstat-dirty-300.png
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/dirty-pages.png
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/dirty-pages-200.png
> > 
> > A commit wait queue will help wipe out the "peaks". The "fixed" graph
> > is
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/vmstat-dirty-300.png
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/dirty-pages.png
> > 
> > Blocking flushes don't need to wait on this queue because they already
> > throttle themselves by waiting on the inode commit lock before/after
> > the commit.  They actually should not wait on this queue, to prevent
> > sync requests being unnecessarily blocked by async ones.
> 
> OK, but isn't it better then to just abort the commit, and have the
> relevant async process retry it later?

I'll drop this patch. I vaguely remember that bursty commit graph
mentioned below

> > I did the commit wait queue after seeing this graph, where there is
> > very bursty pattern of commit submission and hence completion:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/nfs-commit-1000.png

is caused by this condition in nfs_should_commit():

        /* big enough */
        if (to_commit >= MIN_WRITEBACK_PAGES)
                return true;

It's because the 100 dd's accumulated 4MB dirty pages at roughly the
same time. Then I added the in_commit accounting (for the below test)
and wait queue. It seems that the below condition is good enough to
smooth out the commit distribution.

        /* active commits drop low: kick more IO for the server disk */
        if (to_commit > in_commit / 2)
                return true;

And I'm going further remove the above two conditions, and do a much
more simple change:

-               if (nfsi->ncommit <= (nfsi->npages >> 1))
+               if (nfsi->ncommit <= (nfsi->npages >> 4))
                        goto out_mark_dirty;

The change to ">> 4" helps reduce the fluctuation to the acceptable
level: balance_dirty_page() is now doing soft dirty throttling in a
small range of bdi_dirty_limit/8. The above change guarantees that
when an NFS commit completes, the bdi_dirty won't suddenly drop out
of the soft throttling region. On my mem=3GB test box and 1-dd case,
npages/16 ~= 32MB is still a large size.

Basic tests show that it achieves roughly the same effect with these
two patches

[PATCH 29/35] nfs: in-commit pages accounting and wait queue
[PATCH 30/35] nfs: heuristics to avoid commit

It would not only be simpler, but also be able to do larger commits in
the case of "fast and memory bounty server/client connected by slow
network". In this case, the above two patches will do 4MB commits,
while the simpler change can do much larger.

> This is a code path which is followed by kswapd, for instance. It seems
> dangerous to be throttling that instead of allowing it to proceed (and
> perhaps being able to free up memory on some other partition in the mean
> time).

It seems pageout() calls nfs_writepage(), the latter does unstable
write and also won't commit the page. This means pageout() cannot
guarantee free of the page at all.. so NFS dirty pages are virtually
unreclaimable..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 29/35] nfs: in-commit pages accounting and wait queue
@ 2010-12-15 15:07           ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-15 15:07 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	linux-mm, linux-fsdevel, LKML

On Tue, Dec 14, 2010 at 11:57:25PM +0800, Trond Myklebust wrote:
> On Tue, 2010-12-14 at 23:40 +0800, Wu Fengguang wrote:
> > On Tue, Dec 14, 2010 at 05:15:51AM +0800, Trond Myklebust wrote:
> > > On Mon, 2010-12-13 at 22:47 +0800, Wu Fengguang wrote:
> > > > plain text document attachment (writeback-nfs-in-commit.patch)
> > > > When doing 10+ concurrent dd's, I observed very bumpy commits submission
> > > > (partly because the dd's are started at the same time, and hence reached
> > > > 4MB to-commit pages at the same time). Basically we rely on the server
> > > > to complete and return write/commit requests, and want both to progress
> > > > smoothly and not consume too many pages. The write request wait queue is
> > > > not enough as it's mainly network bounded. So add another commit request
> > > > wait queue. Only async writes need to sleep on this queue.
> > > > 
> > > 
> > > I'm not understanding the above reasoning. Why should we serialise
> > > commits at the per-filesystem level (and only for non-blocking flushes
> > > at that)?
> > 
> > I did the commit wait queue after seeing this graph, where there is
> > very bursty pattern of commit submission and hence completion:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/nfs-commit-1000.png
> > 
> > leading to big fluctuations, eg. the almost straight up/straight down
> > lines below
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/vmstat-dirty-300.png
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/dirty-pages.png
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/dirty-pages-200.png
> > 
> > A commit wait queue will help wipe out the "peaks". The "fixed" graph
> > is
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/vmstat-dirty-300.png
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/dirty-pages.png
> > 
> > Blocking flushes don't need to wait on this queue because they already
> > throttle themselves by waiting on the inode commit lock before/after
> > the commit.  They actually should not wait on this queue, to prevent
> > sync requests being unnecessarily blocked by async ones.
> 
> OK, but isn't it better then to just abort the commit, and have the
> relevant async process retry it later?

I'll drop this patch. I vaguely remember that bursty commit graph
mentioned below

> > I did the commit wait queue after seeing this graph, where there is
> > very bursty pattern of commit submission and hence completion:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/nfs-commit-1000.png

is caused by this condition in nfs_should_commit():

        /* big enough */
        if (to_commit >= MIN_WRITEBACK_PAGES)
                return true;

It's because the 100 dd's accumulated 4MB dirty pages at roughly the
same time. Then I added the in_commit accounting (for the below test)
and wait queue. It seems that the below condition is good enough to
smooth out the commit distribution.

        /* active commits drop low: kick more IO for the server disk */
        if (to_commit > in_commit / 2)
                return true;

And I'm going further remove the above two conditions, and do a much
more simple change:

-               if (nfsi->ncommit <= (nfsi->npages >> 1))
+               if (nfsi->ncommit <= (nfsi->npages >> 4))
                        goto out_mark_dirty;

The change to ">> 4" helps reduce the fluctuation to the acceptable
level: balance_dirty_page() is now doing soft dirty throttling in a
small range of bdi_dirty_limit/8. The above change guarantees that
when an NFS commit completes, the bdi_dirty won't suddenly drop out
of the soft throttling region. On my mem=3GB test box and 1-dd case,
npages/16 ~= 32MB is still a large size.

Basic tests show that it achieves roughly the same effect with these
two patches

[PATCH 29/35] nfs: in-commit pages accounting and wait queue
[PATCH 30/35] nfs: heuristics to avoid commit

It would not only be simpler, but also be able to do larger commits in
the case of "fast and memory bounty server/client connected by slow
network". In this case, the above two patches will do 4MB commits,
while the simpler change can do much larger.

> This is a code path which is followed by kswapd, for instance. It seems
> dangerous to be throttling that instead of allowing it to proceed (and
> perhaps being able to free up memory on some other partition in the mean
> time).

It seems pageout() calls nfs_writepage(), the latter does unstable
write and also won't commit the page. This means pageout() cannot
guarantee free of the page at all.. so NFS dirty pages are virtually
unreclaimable..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time
  2010-12-14 13:59       ` Wu Fengguang
  (?)
@ 2010-12-15 18:48         ` Richard Kennedy
  -1 siblings, 0 replies; 202+ messages in thread
From: Richard Kennedy @ 2010-12-15 18:48 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, 2010-12-14 at 21:59 +0800, Wu Fengguang wrote:
> Hi Richard,
> 
> On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
> > On Mon, 2010-12-13 at 22:46 +0800, Wu Fengguang wrote:
> > > plain text document attachment
> > > (writeback-speedup-per-bdi-threshold-ramp-up.patch)
> > > Reduce the dampening for the control system, yielding faster
> > > convergence.
> > > 
> > > Currently it converges at a snail's pace for slow devices (in order of
> > > minutes).  For really fast storage, the convergence speed should be fine.
> > > 
> > > It makes sense to make it reasonably fast for typical desktops.
> > > 
> > > After patch, it converges in ~10 seconds for 60MB/s writes and 4GB mem.
> > > So expect ~1s for a fast 600MB/s storage under 4GB mem, or ~4s under
> > > 16GB mem, which seems reasonable.
> > > 
> > > $ while true; do grep BdiDirtyThresh /debug/bdi/8:0/stats; sleep 1; done
> > > BdiDirtyThresh:            0 kB
> > > BdiDirtyThresh:       118748 kB
> > > BdiDirtyThresh:       214280 kB
> > > BdiDirtyThresh:       303868 kB
> > > BdiDirtyThresh:       376528 kB
> > > BdiDirtyThresh:       411180 kB
> > > BdiDirtyThresh:       448636 kB
> > > BdiDirtyThresh:       472260 kB
> > > BdiDirtyThresh:       490924 kB
> > > BdiDirtyThresh:       499596 kB
> > > BdiDirtyThresh:       507068 kB
> > > ...
> > > DirtyThresh:          530392 kB
> > > 
> > > CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > > CC: Richard Kennedy <richard@rsk.demon.co.uk>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  mm/page-writeback.c |    2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > --- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
> > > +++ linux-next/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
> > > @@ -145,7 +145,7 @@ static int calc_period_shift(void)
> > >  	else
> > >  		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > >  				100;
> > > -	return 2 + ilog2(dirty_total - 1);
> > > +	return ilog2(dirty_total - 1) - 1;
> > >  }
> > >  
> > >  /*
> > > 
> > > 
> > Hi Fengguang,
> > 
> > I've been running my test set on your v3 series and generally it's
> > giving good results in line with the mainline kernel, with much less
> > variability and lower standard deviation of the results so it is much
> > more repeatable.
> 
> Glad to hear that, and thank you very much for trying it out!
> 
> > However, it doesn't seem to be honouring the background_dirty_threshold.
> 
> > The attached graph is from a simple fio write test of 400Mb on ext4.
> > All dirty pages are completely written in 15 seconds, but I expect to
> > see up to background_dirty_threshold pages staying dirty until the 30
> > second background task writes them out. So it is much too eager to write
> > back dirty pages.
>  
> This is interesting, and seems easy to root cause. When testing v4,
> would you help collect the following trace events?
> 
> echo 1 > /debug/tracing/events/writeback/balance_dirty_pages/enable
> echo 1 > /debug/tracing/events/writeback/balance_dirty_state/enable
> echo 1 > /debug/tracing/events/writeback/writeback_single_inode/enable
> 
> They'll have good opportunity to disclose the bug.
> 
> > As to the ramp up time, when writing to 2 disks at the same time I see
> > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > value after one of the write stops. So I think this could be speeded up
> > even more, at least on my setup.
> 
> I have the roughly same ramp up time on the 1-disk 3GB mem test:
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
>  
> Given that it's the typical desktop, it does seem reasonable to speed
> it up further.
> 
> > I am just about to start testing v4 & will report anything interesting.
> 
> Thanks!
> 
> Fengguang

I just mailed the trace log to Fengguang, it is a bit big to post to
this list. If anyone wants it, let me know and I'll mail to them
directly.

I'm also seeing a write stall in some of my tests. When writing 400Mb
after about 6 seconds I'm see a few seconds when there are no reported
sectors written to sda & there are no pages under writeback although
there are lots of dirty pages. ( the graph I sent previously shows this
stall as well )

regards
Richard




^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time
@ 2010-12-15 18:48         ` Richard Kennedy
  0 siblings, 0 replies; 202+ messages in thread
From: Richard Kennedy @ 2010-12-15 18:48 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, 2010-12-14 at 21:59 +0800, Wu Fengguang wrote:
> Hi Richard,
> 
> On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
> > On Mon, 2010-12-13 at 22:46 +0800, Wu Fengguang wrote:
> > > plain text document attachment
> > > (writeback-speedup-per-bdi-threshold-ramp-up.patch)
> > > Reduce the dampening for the control system, yielding faster
> > > convergence.
> > > 
> > > Currently it converges at a snail's pace for slow devices (in order of
> > > minutes).  For really fast storage, the convergence speed should be fine.
> > > 
> > > It makes sense to make it reasonably fast for typical desktops.
> > > 
> > > After patch, it converges in ~10 seconds for 60MB/s writes and 4GB mem.
> > > So expect ~1s for a fast 600MB/s storage under 4GB mem, or ~4s under
> > > 16GB mem, which seems reasonable.
> > > 
> > > $ while true; do grep BdiDirtyThresh /debug/bdi/8:0/stats; sleep 1; done
> > > BdiDirtyThresh:            0 kB
> > > BdiDirtyThresh:       118748 kB
> > > BdiDirtyThresh:       214280 kB
> > > BdiDirtyThresh:       303868 kB
> > > BdiDirtyThresh:       376528 kB
> > > BdiDirtyThresh:       411180 kB
> > > BdiDirtyThresh:       448636 kB
> > > BdiDirtyThresh:       472260 kB
> > > BdiDirtyThresh:       490924 kB
> > > BdiDirtyThresh:       499596 kB
> > > BdiDirtyThresh:       507068 kB
> > > ...
> > > DirtyThresh:          530392 kB
> > > 
> > > CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > > CC: Richard Kennedy <richard@rsk.demon.co.uk>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  mm/page-writeback.c |    2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > --- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
> > > +++ linux-next/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
> > > @@ -145,7 +145,7 @@ static int calc_period_shift(void)
> > >  	else
> > >  		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > >  				100;
> > > -	return 2 + ilog2(dirty_total - 1);
> > > +	return ilog2(dirty_total - 1) - 1;
> > >  }
> > >  
> > >  /*
> > > 
> > > 
> > Hi Fengguang,
> > 
> > I've been running my test set on your v3 series and generally it's
> > giving good results in line with the mainline kernel, with much less
> > variability and lower standard deviation of the results so it is much
> > more repeatable.
> 
> Glad to hear that, and thank you very much for trying it out!
> 
> > However, it doesn't seem to be honouring the background_dirty_threshold.
> 
> > The attached graph is from a simple fio write test of 400Mb on ext4.
> > All dirty pages are completely written in 15 seconds, but I expect to
> > see up to background_dirty_threshold pages staying dirty until the 30
> > second background task writes them out. So it is much too eager to write
> > back dirty pages.
>  
> This is interesting, and seems easy to root cause. When testing v4,
> would you help collect the following trace events?
> 
> echo 1 > /debug/tracing/events/writeback/balance_dirty_pages/enable
> echo 1 > /debug/tracing/events/writeback/balance_dirty_state/enable
> echo 1 > /debug/tracing/events/writeback/writeback_single_inode/enable
> 
> They'll have good opportunity to disclose the bug.
> 
> > As to the ramp up time, when writing to 2 disks at the same time I see
> > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > value after one of the write stops. So I think this could be speeded up
> > even more, at least on my setup.
> 
> I have the roughly same ramp up time on the 1-disk 3GB mem test:
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
>  
> Given that it's the typical desktop, it does seem reasonable to speed
> it up further.
> 
> > I am just about to start testing v4 & will report anything interesting.
> 
> Thanks!
> 
> Fengguang

I just mailed the trace log to Fengguang, it is a bit big to post to
this list. If anyone wants it, let me know and I'll mail to them
directly.

I'm also seeing a write stall in some of my tests. When writing 400Mb
after about 6 seconds I'm see a few seconds when there are no reported
sectors written to sda & there are no pages under writeback although
there are lots of dirty pages. ( the graph I sent previously shows this
stall as well )

regards
Richard



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time
@ 2010-12-15 18:48         ` Richard Kennedy
  0 siblings, 0 replies; 202+ messages in thread
From: Richard Kennedy @ 2010-12-15 18:48 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, 2010-12-14 at 21:59 +0800, Wu Fengguang wrote:
> Hi Richard,
> 
> On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
> > On Mon, 2010-12-13 at 22:46 +0800, Wu Fengguang wrote:
> > > plain text document attachment
> > > (writeback-speedup-per-bdi-threshold-ramp-up.patch)
> > > Reduce the dampening for the control system, yielding faster
> > > convergence.
> > > 
> > > Currently it converges at a snail's pace for slow devices (in order of
> > > minutes).  For really fast storage, the convergence speed should be fine.
> > > 
> > > It makes sense to make it reasonably fast for typical desktops.
> > > 
> > > After patch, it converges in ~10 seconds for 60MB/s writes and 4GB mem.
> > > So expect ~1s for a fast 600MB/s storage under 4GB mem, or ~4s under
> > > 16GB mem, which seems reasonable.
> > > 
> > > $ while true; do grep BdiDirtyThresh /debug/bdi/8:0/stats; sleep 1; done
> > > BdiDirtyThresh:            0 kB
> > > BdiDirtyThresh:       118748 kB
> > > BdiDirtyThresh:       214280 kB
> > > BdiDirtyThresh:       303868 kB
> > > BdiDirtyThresh:       376528 kB
> > > BdiDirtyThresh:       411180 kB
> > > BdiDirtyThresh:       448636 kB
> > > BdiDirtyThresh:       472260 kB
> > > BdiDirtyThresh:       490924 kB
> > > BdiDirtyThresh:       499596 kB
> > > BdiDirtyThresh:       507068 kB
> > > ...
> > > DirtyThresh:          530392 kB
> > > 
> > > CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > > CC: Richard Kennedy <richard@rsk.demon.co.uk>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  mm/page-writeback.c |    2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > --- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
> > > +++ linux-next/mm/page-writeback.c	2010-12-13 21:46:11.000000000 +0800
> > > @@ -145,7 +145,7 @@ static int calc_period_shift(void)
> > >  	else
> > >  		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > >  				100;
> > > -	return 2 + ilog2(dirty_total - 1);
> > > +	return ilog2(dirty_total - 1) - 1;
> > >  }
> > >  
> > >  /*
> > > 
> > > 
> > Hi Fengguang,
> > 
> > I've been running my test set on your v3 series and generally it's
> > giving good results in line with the mainline kernel, with much less
> > variability and lower standard deviation of the results so it is much
> > more repeatable.
> 
> Glad to hear that, and thank you very much for trying it out!
> 
> > However, it doesn't seem to be honouring the background_dirty_threshold.
> 
> > The attached graph is from a simple fio write test of 400Mb on ext4.
> > All dirty pages are completely written in 15 seconds, but I expect to
> > see up to background_dirty_threshold pages staying dirty until the 30
> > second background task writes them out. So it is much too eager to write
> > back dirty pages.
>  
> This is interesting, and seems easy to root cause. When testing v4,
> would you help collect the following trace events?
> 
> echo 1 > /debug/tracing/events/writeback/balance_dirty_pages/enable
> echo 1 > /debug/tracing/events/writeback/balance_dirty_state/enable
> echo 1 > /debug/tracing/events/writeback/writeback_single_inode/enable
> 
> They'll have good opportunity to disclose the bug.
> 
> > As to the ramp up time, when writing to 2 disks at the same time I see
> > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > value after one of the write stops. So I think this could be speeded up
> > even more, at least on my setup.
> 
> I have the roughly same ramp up time on the 1-disk 3GB mem test:
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
>  
> Given that it's the typical desktop, it does seem reasonable to speed
> it up further.
> 
> > I am just about to start testing v4 & will report anything interesting.
> 
> Thanks!
> 
> Fengguang

I just mailed the trace log to Fengguang, it is a bit big to post to
this list. If anyone wants it, let me know and I'll mail to them
directly.

I'm also seeing a write stall in some of my tests. When writing 400Mb
after about 6 seconds I'm see a few seconds when there are no reported
sectors written to sda & there are no pages under writeback although
there are lots of dirty pages. ( the graph I sent previously shows this
stall as well )

regards
Richard



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 17/35] writeback: quit throttling when bdi dirty pages dropped low
  2010-12-13 14:47   ` Wu Fengguang
@ 2010-12-16  5:17     ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-16  5:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	linux-mm, linux-fsdevel, LKML

This patch seems optional and won't improve things noticeably.
Even if we break out of the loop, the task will quickly return to
balance_dirty_pages() as long as the bdi is dirty_exceeded. So I'd
like to drop this patch for now.

Thanks,
Fengguang

On Mon, Dec 13, 2010 at 10:47:03PM +0800, Wu, Fengguang wrote:
> Tests show that bdi_thresh may take minutes to ramp up on a typical
> desktop. The time should be improvable but cannot be eliminated totally.
> So when (background_thresh + dirty_thresh)/2 is reached and
> balance_dirty_pages() starts to throttle the task, it will suddenly find
> the (still low and ramping up) bdi_thresh is exceeded _excessively_. Here
> we definitely don't want to stall the task for one minute (when it's
> writing to USB stick). So introduce an alternative way to break out of
> the loop when the bdi dirty/write pages has dropped by a reasonable
> amount.
> 
> It will at least pause for one loop before trying to break out.
> 
> The break is designed mainly to help the single task case. The break
> threshold is time for writing 125ms data, so that when the task slept
> for MAX_PAUSE=200ms, it will have good chance to break out. For NFS
> there may be only 1-2 completions of large COMMIT per second, in which
> case the task may still get stuck for 1s.
> 
> Note that this opens the chance that during normal operation, a huge
> number of slow dirtiers writing to a really slow device might manage to
> outrun bdi_thresh. But the risk is pretty low.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |   19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
> @@ -693,6 +693,7 @@ static void balance_dirty_pages(struct a
>  	long nr_dirty;
>  	long bdi_dirty;  /* = file_dirty + writeback + unstable_nfs */
>  	long avg_dirty;  /* smoothed bdi_dirty */
> +	long bdi_prev_dirty = 0;
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  	unsigned long bdi_thresh;
> @@ -749,6 +750,24 @@ static void balance_dirty_pages(struct a
>  
>  		bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
>  
> +		/*
> +		 * bdi_thresh takes time to ramp up from the initial 0,
> +		 * especially for slow devices.
> +		 *
> +		 * It's possible that at the moment dirty throttling starts,
> +		 *	bdi_dirty = nr_dirty
> +		 *		  = (background_thresh + dirty_thresh) / 2
> +		 *		  >> bdi_thresh
> +		 * Then the task could be blocked for many seconds to flush all
> +		 * the exceeded (bdi_dirty - bdi_thresh) pages. So offer a
> +		 * complementary way to break out of the loop when 125ms worth
> +		 * of dirty pages have been cleaned during our pause time.
> +		 */
> +		if (nr_dirty <= dirty_thresh &&
> +		    bdi_prev_dirty - bdi_dirty > (long)bdi->write_bandwidth / 8)
> +			break;
> +		bdi_prev_dirty = bdi_dirty;
> +
>  		avg_dirty = bdi->avg_dirty;
>  		if (avg_dirty < bdi_dirty || avg_dirty > task_thresh)
>  			avg_dirty = bdi_dirty;
> 

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 17/35] writeback: quit throttling when bdi dirty pages dropped low
@ 2010-12-16  5:17     ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-16  5:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	linux-mm, linux-fsdevel, LKML

This patch seems optional and won't improve things noticeably.
Even if we break out of the loop, the task will quickly return to
balance_dirty_pages() as long as the bdi is dirty_exceeded. So I'd
like to drop this patch for now.

Thanks,
Fengguang

On Mon, Dec 13, 2010 at 10:47:03PM +0800, Wu, Fengguang wrote:
> Tests show that bdi_thresh may take minutes to ramp up on a typical
> desktop. The time should be improvable but cannot be eliminated totally.
> So when (background_thresh + dirty_thresh)/2 is reached and
> balance_dirty_pages() starts to throttle the task, it will suddenly find
> the (still low and ramping up) bdi_thresh is exceeded _excessively_. Here
> we definitely don't want to stall the task for one minute (when it's
> writing to USB stick). So introduce an alternative way to break out of
> the loop when the bdi dirty/write pages has dropped by a reasonable
> amount.
> 
> It will at least pause for one loop before trying to break out.
> 
> The break is designed mainly to help the single task case. The break
> threshold is time for writing 125ms data, so that when the task slept
> for MAX_PAUSE=200ms, it will have good chance to break out. For NFS
> there may be only 1-2 completions of large COMMIT per second, in which
> case the task may still get stuck for 1s.
> 
> Note that this opens the chance that during normal operation, a huge
> number of slow dirtiers writing to a really slow device might manage to
> outrun bdi_thresh. But the risk is pretty low.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |   19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
> @@ -693,6 +693,7 @@ static void balance_dirty_pages(struct a
>  	long nr_dirty;
>  	long bdi_dirty;  /* = file_dirty + writeback + unstable_nfs */
>  	long avg_dirty;  /* smoothed bdi_dirty */
> +	long bdi_prev_dirty = 0;
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  	unsigned long bdi_thresh;
> @@ -749,6 +750,24 @@ static void balance_dirty_pages(struct a
>  
>  		bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
>  
> +		/*
> +		 * bdi_thresh takes time to ramp up from the initial 0,
> +		 * especially for slow devices.
> +		 *
> +		 * It's possible that at the moment dirty throttling starts,
> +		 *	bdi_dirty = nr_dirty
> +		 *		  = (background_thresh + dirty_thresh) / 2
> +		 *		  >> bdi_thresh
> +		 * Then the task could be blocked for many seconds to flush all
> +		 * the exceeded (bdi_dirty - bdi_thresh) pages. So offer a
> +		 * complementary way to break out of the loop when 125ms worth
> +		 * of dirty pages have been cleaned during our pause time.
> +		 */
> +		if (nr_dirty <= dirty_thresh &&
> +		    bdi_prev_dirty - bdi_dirty > (long)bdi->write_bandwidth / 8)
> +			break;
> +		bdi_prev_dirty = bdi_dirty;
> +
>  		avg_dirty = bdi->avg_dirty;
>  		if (avg_dirty < bdi_dirty || avg_dirty > task_thresh)
>  			avg_dirty = bdi_dirty;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 18/35] writeback: start background writeback earlier
  2010-12-13 14:47   ` Wu Fengguang
@ 2010-12-16  5:37     ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-16  5:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	linux-mm, linux-fsdevel, LKML

On Mon, Dec 13, 2010 at 10:47:04PM +0800, Wu, Fengguang wrote:
> It's possible for some one to suddenly eat lots of memory,
> leading to sudden drop of global dirty limit. So a dirtier
> task may get hard throttled immediately without some previous
> balance_dirty_pages() call to invoke background writeback.
> 
> In this case we need to check for background writeback earlier in the
> loop to avoid stucking the application for very long time. This was not
> a problem before the IO-less balance_dirty_pages() because it will try
> to write something and then break out of the loop regardless of the
> global limit.
> 
> Another scheme this check will help is, the dirty limit is too close to
> the background threshold, so that someone manages to jump directly into
> the pause threshold (background+dirty)/2.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-12-13 21:46:17.000000000 +0800
> @@ -748,6 +748,9 @@ static void balance_dirty_pages(struct a
>  				    bdi_stat(bdi, BDI_WRITEBACK);
>  		}
>  
> +		if (unlikely(!writeback_in_progress(bdi)))
> +			bdi_start_background_writeback(bdi);
> +
>  		bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
>  
>  		/*
> 

The above patch allows this simplification.
---
Subject: writeback: start background writeback earlier - handle laptop mode
Date: Wed Dec 15 20:15:54 CST 2010

The laptop mode handling can be simplified since we've kick background
writeback inside the balance_dirty_pages() loop on dirty_exceeded.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-15 20:14:33.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-15 20:15:39.000000000 +0800
@@ -891,8 +891,10 @@ pause:
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && dirty_exceeded) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	if (laptop_mode)
+		return;
+
+	if (nr_reclaimable > background_thresh)
 		bdi_start_background_writeback(bdi);
 }
 

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 18/35] writeback: start background writeback earlier
@ 2010-12-16  5:37     ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-16  5:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	linux-mm, linux-fsdevel, LKML

On Mon, Dec 13, 2010 at 10:47:04PM +0800, Wu, Fengguang wrote:
> It's possible for some one to suddenly eat lots of memory,
> leading to sudden drop of global dirty limit. So a dirtier
> task may get hard throttled immediately without some previous
> balance_dirty_pages() call to invoke background writeback.
> 
> In this case we need to check for background writeback earlier in the
> loop to avoid stucking the application for very long time. This was not
> a problem before the IO-less balance_dirty_pages() because it will try
> to write something and then break out of the loop regardless of the
> global limit.
> 
> Another scheme this check will help is, the dirty limit is too close to
> the background threshold, so that someone manages to jump directly into
> the pause threshold (background+dirty)/2.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:16.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-12-13 21:46:17.000000000 +0800
> @@ -748,6 +748,9 @@ static void balance_dirty_pages(struct a
>  				    bdi_stat(bdi, BDI_WRITEBACK);
>  		}
>  
> +		if (unlikely(!writeback_in_progress(bdi)))
> +			bdi_start_background_writeback(bdi);
> +
>  		bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
>  
>  		/*
> 

The above patch allows this simplification.
---
Subject: writeback: start background writeback earlier - handle laptop mode
Date: Wed Dec 15 20:15:54 CST 2010

The laptop mode handling can be simplified since we've kick background
writeback inside the balance_dirty_pages() loop on dirty_exceeded.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-15 20:14:33.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-15 20:15:39.000000000 +0800
@@ -891,8 +891,10 @@ pause:
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && dirty_exceeded) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	if (laptop_mode)
+		return;
+
+	if (nr_reclaimable > background_thresh)
 		bdi_start_background_writeback(bdi);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 22/35] writeback: trace global dirty page states
  2010-12-13 14:47   ` Wu Fengguang
@ 2010-12-17  2:19     ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-17  2:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Hugh Dickins, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Mon, Dec 13, 2010 at 10:47:08PM +0800, Wu, Fengguang wrote:

> +	TP_fast_assign(
> +		strlcpy(__entry->bdi,
> +			dev_name(mapping->backing_dev_info->dev), 32);
> +		__entry->ino			= mapping->host->i_ino;

I got an oops against the above line on shmem. Can be fixed by the
below patch, but still not 100% confident..

Thanks,
Fengguang
---
Subject: writeback fix dereferencing NULL shmem mapping->host
Date: Thu Dec 16 22:22:00 CST 2010

The oops happens when doing "cp /proc/vmstat /dev/shm". It seems to be
triggered on accessing host->i_ino, since the offset of i_ino is exactly
0x50. However I'm afraid the problem is not fully understand

1) it's not normal that tmpfs will have mapping->host == NULL

2) I tried removing the dereference as the below diff, however it
   didn't stop the oops. This is very weird.

TRACE_EVENT balance_dirty_state:

 	TP_fast_assign(
 		strlcpy(__entry->bdi,
 			dev_name(mapping->backing_dev_info->dev), 32);
-		__entry->ino			= mapping->host->i_ino;
 		__entry->nr_dirty		= nr_dirty;
 		__entry->nr_writeback		= nr_writeback;
 		__entry->nr_unstable		= nr_unstable;

[  337.018477] EXT3-fs (sda8): mounted filesystem with writeback data mode
[  388.126563] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
[  388.127057] IP: [<ffffffff811a8387>] ftrace_raw_event_balance_dirty_state+0x97/0x130
[  388.127506] PGD b507e067 PUD b1474067 PMD 0
[  388.127858] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[  388.128218] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/queue/scheduler
[  388.128737] CPU 0
[  388.128846] Modules linked in:
[  388.129149]
[  388.129279] Pid: 4222, comm: cp Not tainted 2.6.37-rc5+ #361 DX58SO/
[  388.129625] RIP: 0010:[<ffffffff811a8387>]  [<ffffffff811a8387>] ftrace_raw_event_balance_dirty_state+0x97/0x130
[  388.130165] RSP: 0018:ffff8800a9ab7a98  EFLAGS: 00010202
[  388.130443] RAX: 0000000000000000 RBX: ffffffff81fc3c68 RCX: 0000000000001000
[  388.130792] RDX: 0000000000000020 RSI: 0000000000000282 RDI: ffff8800a99a74a0
[  388.131141] RBP: ffff8800a9ab7b08 R08: 000000000000001a R09: 0000000000000480
[  388.131490] R10: ffffffff81fdd660 R11: 0000000000000001 R12: 0000000000000000
[  388.131838] R13: ffff8800a99a7494 R14: ffff8800a99a7490 R15: 0000000000010ebf
[  388.132189] FS:  00007fc4b1f217a0(0000) GS:ffff8800b7400000(0000) knlGS:0000000000000000
[  388.132606] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  388.132901] CR2: 0000000000000050 CR3: 00000000b268a000 CR4: 00000000000006f0
[  388.133250] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  388.133598] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  388.133948] Process cp (pid: 4222, threadinfo ffff8800a9ab6000, task ffff8800b2e09900)
[  388.134359] Stack:
[  388.134508]  ffff8800a9ab7ac8 0000000000000002 0000000000000000 0000000000000005
[  388.135049]  ffff8800b1757320 0000000000000282 ffff8800a9ab7ae8 ffff8800b5f66cc0
[  388.135590]  ffff8800b1757178 0000000000021d7f ffff8800a9a7e350 0000000000010ebf
[  388.136132] Call Trace:
[  388.136303]  [<ffffffff81137f00>] balance_dirty_pages_ratelimited_nr+0x6a0/0x7f0
[  388.136698]  [<ffffffff81141c37>] ? shmem_getpage+0x777/0xa80
[  388.136996]  [<ffffffff8112c575>] generic_file_buffered_write+0x1f5/0x290
[  388.137333]  [<ffffffff8108c026>] ? current_fs_time+0x16/0x60
[  388.137631]  [<ffffffff81a815c0>] ? mutex_lock_nested+0x280/0x350
[  388.137940]  [<ffffffff8112e394>] __generic_file_aio_write+0x244/0x450
[  388.138267]  [<ffffffff81a815d2>] ? mutex_lock_nested+0x292/0x350
[  388.138576]  [<ffffffff8112e5f8>] ? generic_file_aio_write+0x58/0xd0
[  388.138896]  [<ffffffff8112e5f8>] ? generic_file_aio_write+0x58/0xd0
[  388.139216]  [<ffffffff8112e60b>] generic_file_aio_write+0x6b/0xd0
[  388.139531]  [<ffffffff81182aaa>] do_sync_write+0xda/0x120
[  388.139819]  [<ffffffff810bb55d>] ? lock_release_holdtime+0x3d/0x180
[  388.140139]  [<ffffffff81a8397b>] ? _raw_spin_unlock+0x2b/0x40
[  388.140440]  [<ffffffff811d839e>] ? proc_reg_read+0x8e/0xc0
[  388.140731]  [<ffffffff8118322e>] vfs_write+0xce/0x190
[  388.141004]  [<ffffffff81183564>] sys_write+0x54/0x90
[  388.141274]  [<ffffffff8103af42>] system_call_fastpath+0x16/0x1b
[  388.141579] Code: 84 85 00 00 00 48 89 c7 e8 27 e5 f5 ff 48 8b 55 b0 49 89 c5 48 8b 82 f8 00 00 00 49 8d 7d 0c 48 8b 80 08 04 00 00 ba 20 00 00 00 <48> 8b 70 50 48 85 f6 48 0f 44 70 10 e8 58 f8 29 00 48 8b 45 a8
[  388.144899] RIP  [<ffffffff811a8387>] ftrace_raw_event_balance_dirty_state+0x97/0x130
[  388.145346]  RSP <ffff8800a9ab7a98>
[  388.145555] CR2: 0000000000000050
[  388.146039] ---[ end trace d824f7aad3debcd9 ]---

CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    3 +++
 1 file changed, 3 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2010-12-17 09:30:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-17 09:31:05.000000000 +0800
@@ -907,6 +907,9 @@ void balance_dirty_pages_ratelimited_nr(
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
+	if (!mapping_cap_writeback_dirty(mapping))
+		return;
+
 	current->nr_dirtied += nr_pages_dirtied;
 
 	if (unlikely(!current->nr_dirtied_pause))

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 22/35] writeback: trace global dirty page states
@ 2010-12-17  2:19     ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-17  2:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Hugh Dickins, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Mon, Dec 13, 2010 at 10:47:08PM +0800, Wu, Fengguang wrote:

> +	TP_fast_assign(
> +		strlcpy(__entry->bdi,
> +			dev_name(mapping->backing_dev_info->dev), 32);
> +		__entry->ino			= mapping->host->i_ino;

I got an oops against the above line on shmem. Can be fixed by the
below patch, but still not 100% confident..

Thanks,
Fengguang
---
Subject: writeback fix dereferencing NULL shmem mapping->host
Date: Thu Dec 16 22:22:00 CST 2010

The oops happens when doing "cp /proc/vmstat /dev/shm". It seems to be
triggered on accessing host->i_ino, since the offset of i_ino is exactly
0x50. However I'm afraid the problem is not fully understand

1) it's not normal that tmpfs will have mapping->host == NULL

2) I tried removing the dereference as the below diff, however it
   didn't stop the oops. This is very weird.

TRACE_EVENT balance_dirty_state:

 	TP_fast_assign(
 		strlcpy(__entry->bdi,
 			dev_name(mapping->backing_dev_info->dev), 32);
-		__entry->ino			= mapping->host->i_ino;
 		__entry->nr_dirty		= nr_dirty;
 		__entry->nr_writeback		= nr_writeback;
 		__entry->nr_unstable		= nr_unstable;

[  337.018477] EXT3-fs (sda8): mounted filesystem with writeback data mode
[  388.126563] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
[  388.127057] IP: [<ffffffff811a8387>] ftrace_raw_event_balance_dirty_state+0x97/0x130
[  388.127506] PGD b507e067 PUD b1474067 PMD 0
[  388.127858] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[  388.128218] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/queue/scheduler
[  388.128737] CPU 0
[  388.128846] Modules linked in:
[  388.129149]
[  388.129279] Pid: 4222, comm: cp Not tainted 2.6.37-rc5+ #361 DX58SO/
[  388.129625] RIP: 0010:[<ffffffff811a8387>]  [<ffffffff811a8387>] ftrace_raw_event_balance_dirty_state+0x97/0x130
[  388.130165] RSP: 0018:ffff8800a9ab7a98  EFLAGS: 00010202
[  388.130443] RAX: 0000000000000000 RBX: ffffffff81fc3c68 RCX: 0000000000001000
[  388.130792] RDX: 0000000000000020 RSI: 0000000000000282 RDI: ffff8800a99a74a0
[  388.131141] RBP: ffff8800a9ab7b08 R08: 000000000000001a R09: 0000000000000480
[  388.131490] R10: ffffffff81fdd660 R11: 0000000000000001 R12: 0000000000000000
[  388.131838] R13: ffff8800a99a7494 R14: ffff8800a99a7490 R15: 0000000000010ebf
[  388.132189] FS:  00007fc4b1f217a0(0000) GS:ffff8800b7400000(0000) knlGS:0000000000000000
[  388.132606] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  388.132901] CR2: 0000000000000050 CR3: 00000000b268a000 CR4: 00000000000006f0
[  388.133250] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  388.133598] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  388.133948] Process cp (pid: 4222, threadinfo ffff8800a9ab6000, task ffff8800b2e09900)
[  388.134359] Stack:
[  388.134508]  ffff8800a9ab7ac8 0000000000000002 0000000000000000 0000000000000005
[  388.135049]  ffff8800b1757320 0000000000000282 ffff8800a9ab7ae8 ffff8800b5f66cc0
[  388.135590]  ffff8800b1757178 0000000000021d7f ffff8800a9a7e350 0000000000010ebf
[  388.136132] Call Trace:
[  388.136303]  [<ffffffff81137f00>] balance_dirty_pages_ratelimited_nr+0x6a0/0x7f0
[  388.136698]  [<ffffffff81141c37>] ? shmem_getpage+0x777/0xa80
[  388.136996]  [<ffffffff8112c575>] generic_file_buffered_write+0x1f5/0x290
[  388.137333]  [<ffffffff8108c026>] ? current_fs_time+0x16/0x60
[  388.137631]  [<ffffffff81a815c0>] ? mutex_lock_nested+0x280/0x350
[  388.137940]  [<ffffffff8112e394>] __generic_file_aio_write+0x244/0x450
[  388.138267]  [<ffffffff81a815d2>] ? mutex_lock_nested+0x292/0x350
[  388.138576]  [<ffffffff8112e5f8>] ? generic_file_aio_write+0x58/0xd0
[  388.138896]  [<ffffffff8112e5f8>] ? generic_file_aio_write+0x58/0xd0
[  388.139216]  [<ffffffff8112e60b>] generic_file_aio_write+0x6b/0xd0
[  388.139531]  [<ffffffff81182aaa>] do_sync_write+0xda/0x120
[  388.139819]  [<ffffffff810bb55d>] ? lock_release_holdtime+0x3d/0x180
[  388.140139]  [<ffffffff81a8397b>] ? _raw_spin_unlock+0x2b/0x40
[  388.140440]  [<ffffffff811d839e>] ? proc_reg_read+0x8e/0xc0
[  388.140731]  [<ffffffff8118322e>] vfs_write+0xce/0x190
[  388.141004]  [<ffffffff81183564>] sys_write+0x54/0x90
[  388.141274]  [<ffffffff8103af42>] system_call_fastpath+0x16/0x1b
[  388.141579] Code: 84 85 00 00 00 48 89 c7 e8 27 e5 f5 ff 48 8b 55 b0 49 89 c5 48 8b 82 f8 00 00 00 49 8d 7d 0c 48 8b 80 08 04 00 00 ba 20 00 00 00 <48> 8b 70 50 48 85 f6 48 0f 44 70 10 e8 58 f8 29 00 48 8b 45 a8
[  388.144899] RIP  [<ffffffff811a8387>] ftrace_raw_event_balance_dirty_state+0x97/0x130
[  388.145346]  RSP <ffff8800a9ab7a98>
[  388.145555] CR2: 0000000000000050
[  388.146039] ---[ end trace d824f7aad3debcd9 ]---

CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    3 +++
 1 file changed, 3 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2010-12-17 09:30:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-17 09:31:05.000000000 +0800
@@ -907,6 +907,9 @@ void balance_dirty_pages_ratelimited_nr(
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
+	if (!mapping_cap_writeback_dirty(mapping))
+		return;
+
 	current->nr_dirtied += nr_pages_dirtied;
 
 	if (unlikely(!current->nr_dirtied_pause))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 22/35] writeback: trace global dirty page states
  2010-12-17  2:19     ` Wu Fengguang
@ 2010-12-17  3:11       ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-17  3:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Hugh Dickins, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Fri, Dec 17, 2010 at 10:19:34AM +0800, Wu Fengguang wrote:
> On Mon, Dec 13, 2010 at 10:47:08PM +0800, Wu, Fengguang wrote:
> 
> > +	TP_fast_assign(
> > +		strlcpy(__entry->bdi,
> > +			dev_name(mapping->backing_dev_info->dev), 32);
> > +		__entry->ino			= mapping->host->i_ino;
> 
> I got an oops against the above line on shmem. Can be fixed by the
> below patch, but still not 100% confident..

btw, here is a cleanup of the tracepoint.

Thanks,
Fengguang
---
Subject: writeback: simplify and rename tracepoint balance_dirty_state to global_dirty_state
Date: Fri Dec 17 10:37:35 CST 2010

Make it a more clean interface, and also track the background flusher
when it calls over_bground_thresh() to check the global limits.

The removed information could go into tracepoint balance_dirty_pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   37 +++++++----------------------
 mm/page-writeback.c              |   16 +++---------
 2 files changed, 13 insertions(+), 40 deletions(-)

--- linux-next.orig/include/trace/events/writeback.h	2010-12-17 11:05:08.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-12-17 11:06:20.000000000 +0800
@@ -149,60 +149,41 @@ DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
-TRACE_EVENT(balance_dirty_state,
+TRACE_EVENT(global_dirty_state,
 
-	TP_PROTO(struct address_space *mapping,
-		 unsigned long nr_dirty,
-		 unsigned long nr_writeback,
-		 unsigned long nr_unstable,
-		 unsigned long background_thresh,
+	TP_PROTO(unsigned long background_thresh,
 		 unsigned long dirty_thresh
 	),
 
-	TP_ARGS(mapping,
-		nr_dirty,
-		nr_writeback,
-		nr_unstable,
-		background_thresh,
+	TP_ARGS(background_thresh,
 		dirty_thresh
 	),
 
 	TP_STRUCT__entry(
-		__array(char,		bdi, 32)
-		__field(unsigned long,	ino)
 		__field(unsigned long,	nr_dirty)
 		__field(unsigned long,	nr_writeback)
 		__field(unsigned long,	nr_unstable)
 		__field(unsigned long,	background_thresh)
 		__field(unsigned long,	dirty_thresh)
-		__field(unsigned long,	task_dirtied_pause)
 	),
 
 	TP_fast_assign(
-		strlcpy(__entry->bdi,
-			dev_name(mapping->backing_dev_info->dev), 32);
-		__entry->ino			= mapping->host->i_ino;
-		__entry->nr_dirty		= nr_dirty;
-		__entry->nr_writeback		= nr_writeback;
-		__entry->nr_unstable		= nr_unstable;
+		__entry->nr_dirty	= global_page_state(NR_FILE_DIRTY);
+		__entry->nr_writeback	= global_page_state(NR_WRITEBACK);
+		__entry->nr_unstable	= global_page_state(NR_UNSTABLE_NFS);
 		__entry->background_thresh	= background_thresh;
 		__entry->dirty_thresh		= dirty_thresh;
-		__entry->task_dirtied_pause	= current->nr_dirtied_pause;
 	),
 
-	TP_printk("bdi %s: dirty=%lu wb=%lu unstable=%lu "
-		  "bg_thresh=%lu thresh=%lu gap=%ld "
-		  "poll_thresh=%lu ino=%lu",
-		  __entry->bdi,
+	TP_printk("dirty=%lu writeback=%lu unstable=%lu "
+		  "bg_thresh=%lu thresh=%lu gap=%ld",
 		  __entry->nr_dirty,
 		  __entry->nr_writeback,
 		  __entry->nr_unstable,
 		  __entry->background_thresh,
 		  __entry->dirty_thresh,
 		  __entry->dirty_thresh - __entry->nr_dirty -
-		  __entry->nr_writeback - __entry->nr_unstable,
-		  __entry->task_dirtied_pause,
-		  __entry->ino
+		  __entry->nr_writeback - __entry->nr_unstable
 	)
 );
 
--- linux-next.orig/mm/page-writeback.c	2010-12-17 11:05:08.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-17 11:05:09.000000000 +0800
@@ -418,6 +418,7 @@ void global_dirty_limits(unsigned long *
 	}
 	*pbackground = background;
 	*pdirty = dirty;
+	trace_global_dirty_state(background, dirty);
 }
 
 /**
@@ -712,21 +713,12 @@ static void balance_dirty_pages(struct a
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY);
-		bdi_dirty = global_page_state(NR_UNSTABLE_NFS);
-		nr_dirty = global_page_state(NR_WRITEBACK);
+		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+					global_page_state(NR_UNSTABLE_NFS);
+		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
 
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 
-		trace_balance_dirty_state(mapping,
-					  nr_reclaimable,
-					  nr_dirty,
-					  bdi_dirty,
-					  background_thresh,
-					  dirty_thresh);
-		nr_reclaimable += bdi_dirty;
-		nr_dirty += nr_reclaimable;
-
 		/*
 		 * Throttle it only when the background writeback cannot
 		 * catch-up. This avoids (excessively) small writeouts

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 22/35] writeback: trace global dirty page states
@ 2010-12-17  3:11       ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-17  3:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Hugh Dickins, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Fri, Dec 17, 2010 at 10:19:34AM +0800, Wu Fengguang wrote:
> On Mon, Dec 13, 2010 at 10:47:08PM +0800, Wu, Fengguang wrote:
> 
> > +	TP_fast_assign(
> > +		strlcpy(__entry->bdi,
> > +			dev_name(mapping->backing_dev_info->dev), 32);
> > +		__entry->ino			= mapping->host->i_ino;
> 
> I got an oops against the above line on shmem. Can be fixed by the
> below patch, but still not 100% confident..

btw, here is a cleanup of the tracepoint.

Thanks,
Fengguang
---
Subject: writeback: simplify and rename tracepoint balance_dirty_state to global_dirty_state
Date: Fri Dec 17 10:37:35 CST 2010

Make it a more clean interface, and also track the background flusher
when it calls over_bground_thresh() to check the global limits.

The removed information could go into tracepoint balance_dirty_pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   37 +++++++----------------------
 mm/page-writeback.c              |   16 +++---------
 2 files changed, 13 insertions(+), 40 deletions(-)

--- linux-next.orig/include/trace/events/writeback.h	2010-12-17 11:05:08.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-12-17 11:06:20.000000000 +0800
@@ -149,60 +149,41 @@ DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
-TRACE_EVENT(balance_dirty_state,
+TRACE_EVENT(global_dirty_state,
 
-	TP_PROTO(struct address_space *mapping,
-		 unsigned long nr_dirty,
-		 unsigned long nr_writeback,
-		 unsigned long nr_unstable,
-		 unsigned long background_thresh,
+	TP_PROTO(unsigned long background_thresh,
 		 unsigned long dirty_thresh
 	),
 
-	TP_ARGS(mapping,
-		nr_dirty,
-		nr_writeback,
-		nr_unstable,
-		background_thresh,
+	TP_ARGS(background_thresh,
 		dirty_thresh
 	),
 
 	TP_STRUCT__entry(
-		__array(char,		bdi, 32)
-		__field(unsigned long,	ino)
 		__field(unsigned long,	nr_dirty)
 		__field(unsigned long,	nr_writeback)
 		__field(unsigned long,	nr_unstable)
 		__field(unsigned long,	background_thresh)
 		__field(unsigned long,	dirty_thresh)
-		__field(unsigned long,	task_dirtied_pause)
 	),
 
 	TP_fast_assign(
-		strlcpy(__entry->bdi,
-			dev_name(mapping->backing_dev_info->dev), 32);
-		__entry->ino			= mapping->host->i_ino;
-		__entry->nr_dirty		= nr_dirty;
-		__entry->nr_writeback		= nr_writeback;
-		__entry->nr_unstable		= nr_unstable;
+		__entry->nr_dirty	= global_page_state(NR_FILE_DIRTY);
+		__entry->nr_writeback	= global_page_state(NR_WRITEBACK);
+		__entry->nr_unstable	= global_page_state(NR_UNSTABLE_NFS);
 		__entry->background_thresh	= background_thresh;
 		__entry->dirty_thresh		= dirty_thresh;
-		__entry->task_dirtied_pause	= current->nr_dirtied_pause;
 	),
 
-	TP_printk("bdi %s: dirty=%lu wb=%lu unstable=%lu "
-		  "bg_thresh=%lu thresh=%lu gap=%ld "
-		  "poll_thresh=%lu ino=%lu",
-		  __entry->bdi,
+	TP_printk("dirty=%lu writeback=%lu unstable=%lu "
+		  "bg_thresh=%lu thresh=%lu gap=%ld",
 		  __entry->nr_dirty,
 		  __entry->nr_writeback,
 		  __entry->nr_unstable,
 		  __entry->background_thresh,
 		  __entry->dirty_thresh,
 		  __entry->dirty_thresh - __entry->nr_dirty -
-		  __entry->nr_writeback - __entry->nr_unstable,
-		  __entry->task_dirtied_pause,
-		  __entry->ino
+		  __entry->nr_writeback - __entry->nr_unstable
 	)
 );
 
--- linux-next.orig/mm/page-writeback.c	2010-12-17 11:05:08.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-17 11:05:09.000000000 +0800
@@ -418,6 +418,7 @@ void global_dirty_limits(unsigned long *
 	}
 	*pbackground = background;
 	*pdirty = dirty;
+	trace_global_dirty_state(background, dirty);
 }
 
 /**
@@ -712,21 +713,12 @@ static void balance_dirty_pages(struct a
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY);
-		bdi_dirty = global_page_state(NR_UNSTABLE_NFS);
-		nr_dirty = global_page_state(NR_WRITEBACK);
+		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+					global_page_state(NR_UNSTABLE_NFS);
+		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
 
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 
-		trace_balance_dirty_state(mapping,
-					  nr_reclaimable,
-					  nr_dirty,
-					  bdi_dirty,
-					  background_thresh,
-					  dirty_thresh);
-		nr_reclaimable += bdi_dirty;
-		nr_dirty += nr_reclaimable;
-
 		/*
 		 * Throttle it only when the background writeback cannot
 		 * catch-up. This avoids (excessively) small writeouts

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 22/35] writeback: trace global dirty page states
  2010-12-17  2:19     ` Wu Fengguang
@ 2010-12-17  6:52       ` Hugh Dickins
  -1 siblings, 0 replies; 202+ messages in thread
From: Hugh Dickins @ 2010-12-17  6:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Fri, 17 Dec 2010, Wu Fengguang wrote:
> On Mon, Dec 13, 2010 at 10:47:08PM +0800, Wu, Fengguang wrote:
> 
> > +	TP_fast_assign(
> > +		strlcpy(__entry->bdi,
> > +			dev_name(mapping->backing_dev_info->dev), 32);
> > +		__entry->ino			= mapping->host->i_ino;
> 
> I got an oops against the above line on shmem. Can be fixed by the
> below patch, but still not 100% confident..
> 
> Thanks,
> Fengguang
> ---
> Subject: writeback fix dereferencing NULL shmem mapping->host
> Date: Thu Dec 16 22:22:00 CST 2010
> 
> The oops happens when doing "cp /proc/vmstat /dev/shm". It seems to be
> triggered on accessing host->i_ino, since the offset of i_ino is exactly
> 0x50. However I'm afraid the problem is not fully understand
> 
> 1) it's not normal that tmpfs will have mapping->host == NULL
> 
> 2) I tried removing the dereference as the below diff, however it
>    didn't stop the oops. This is very weird.
> 
> TRACE_EVENT balance_dirty_state:
> 
>  	TP_fast_assign(
>  		strlcpy(__entry->bdi,
>  			dev_name(mapping->backing_dev_info->dev), 32);

I believe this line above is actually the problem: you can imagine that
tmpfs leaves backing_dev_info->dev NULL, and dev_name() appears to
access dev->init_name at 64-bit offset 0x50 down struct device.

> -		__entry->ino			= mapping->host->i_ino;
>  		__entry->nr_dirty		= nr_dirty;
>  		__entry->nr_writeback		= nr_writeback;
>  		__entry->nr_unstable		= nr_unstable;
...
> 
> CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>

I prefer hughd@google.com, but the tiscali address survived unexpectedly.

> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2010-12-17 09:30:11.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-12-17 09:31:05.000000000 +0800
> @@ -907,6 +907,9 @@ void balance_dirty_pages_ratelimited_nr(
>  {
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  
> +	if (!mapping_cap_writeback_dirty(mapping))
> +		return;
> +
>  	current->nr_dirtied += nr_pages_dirtied;
>  
>  	if (unlikely(!current->nr_dirtied_pause))

That would not really be the right patch to fix your oops, but it
or something like would be a very sensible patch in its own right:
looking back through old patches I never got around to sending in,
I can see I had a very similar one two years ago, to save wasting
time on dirty page accounting here when it's inappropriate.
Though mine was testing !mapping_cap_account_dirty(mapping).

Hugh

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 22/35] writeback: trace global dirty page states
@ 2010-12-17  6:52       ` Hugh Dickins
  0 siblings, 0 replies; 202+ messages in thread
From: Hugh Dickins @ 2010-12-17  6:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Fri, 17 Dec 2010, Wu Fengguang wrote:
> On Mon, Dec 13, 2010 at 10:47:08PM +0800, Wu, Fengguang wrote:
> 
> > +	TP_fast_assign(
> > +		strlcpy(__entry->bdi,
> > +			dev_name(mapping->backing_dev_info->dev), 32);
> > +		__entry->ino			= mapping->host->i_ino;
> 
> I got an oops against the above line on shmem. Can be fixed by the
> below patch, but still not 100% confident..
> 
> Thanks,
> Fengguang
> ---
> Subject: writeback fix dereferencing NULL shmem mapping->host
> Date: Thu Dec 16 22:22:00 CST 2010
> 
> The oops happens when doing "cp /proc/vmstat /dev/shm". It seems to be
> triggered on accessing host->i_ino, since the offset of i_ino is exactly
> 0x50. However I'm afraid the problem is not fully understand
> 
> 1) it's not normal that tmpfs will have mapping->host == NULL
> 
> 2) I tried removing the dereference as the below diff, however it
>    didn't stop the oops. This is very weird.
> 
> TRACE_EVENT balance_dirty_state:
> 
>  	TP_fast_assign(
>  		strlcpy(__entry->bdi,
>  			dev_name(mapping->backing_dev_info->dev), 32);

I believe this line above is actually the problem: you can imagine that
tmpfs leaves backing_dev_info->dev NULL, and dev_name() appears to
access dev->init_name at 64-bit offset 0x50 down struct device.

> -		__entry->ino			= mapping->host->i_ino;
>  		__entry->nr_dirty		= nr_dirty;
>  		__entry->nr_writeback		= nr_writeback;
>  		__entry->nr_unstable		= nr_unstable;
...
> 
> CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>

I prefer hughd@google.com, but the tiscali address survived unexpectedly.

> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2010-12-17 09:30:11.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-12-17 09:31:05.000000000 +0800
> @@ -907,6 +907,9 @@ void balance_dirty_pages_ratelimited_nr(
>  {
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  
> +	if (!mapping_cap_writeback_dirty(mapping))
> +		return;
> +
>  	current->nr_dirtied += nr_pages_dirtied;
>  
>  	if (unlikely(!current->nr_dirtied_pause))

That would not really be the right patch to fix your oops, but it
or something like would be a very sensible patch in its own right:
looking back through old patches I never got around to sending in,
I can see I had a very similar one two years ago, to save wasting
time on dirty page accounting here when it's inappropriate.
Though mine was testing !mapping_cap_account_dirty(mapping).

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 22/35] writeback: trace global dirty page states
  2010-12-17  6:52       ` Hugh Dickins
@ 2010-12-17  9:31         ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-17  9:31 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Fri, Dec 17, 2010 at 02:52:50PM +0800, Hugh Dickins wrote:
> On Fri, 17 Dec 2010, Wu Fengguang wrote:
> > On Mon, Dec 13, 2010 at 10:47:08PM +0800, Wu, Fengguang wrote:
> > 
> > > +	TP_fast_assign(
> > > +		strlcpy(__entry->bdi,
> > > +			dev_name(mapping->backing_dev_info->dev), 32);
> > > +		__entry->ino			= mapping->host->i_ino;
> > 
> > I got an oops against the above line on shmem. Can be fixed by the
> > below patch, but still not 100% confident..
> > 
> > Thanks,
> > Fengguang
> > ---
> > Subject: writeback fix dereferencing NULL shmem mapping->host
> > Date: Thu Dec 16 22:22:00 CST 2010
> > 
> > The oops happens when doing "cp /proc/vmstat /dev/shm". It seems to be
> > triggered on accessing host->i_ino, since the offset of i_ino is exactly
> > 0x50. However I'm afraid the problem is not fully understand
> > 
> > 1) it's not normal that tmpfs will have mapping->host == NULL
> > 
> > 2) I tried removing the dereference as the below diff, however it
> >    didn't stop the oops. This is very weird.
> > 
> > TRACE_EVENT balance_dirty_state:
> > 
> >  	TP_fast_assign(
> >  		strlcpy(__entry->bdi,
> >  			dev_name(mapping->backing_dev_info->dev), 32);
> 
> I believe this line above is actually the problem: you can imagine that
> tmpfs leaves backing_dev_info->dev NULL, and dev_name() appears to

Ah, I didn't notice that obvious fact in shmem_backing_dev_info..

> access dev->init_name at 64-bit offset 0x50 down struct device.

And it's such a coincident that the two lines accessed different
struct members both with offset 0x50 :)

> > -		__entry->ino			= mapping->host->i_ino;
> >  		__entry->nr_dirty		= nr_dirty;
> >  		__entry->nr_writeback		= nr_writeback;
> >  		__entry->nr_unstable		= nr_unstable;
> ...
> > 
> > CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> 
> I prefer hughd@google.com, but the tiscali address survived unexpectedly.

OK, just updated my alias db.

> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/page-writeback.c |    3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > --- linux-next.orig/mm/page-writeback.c	2010-12-17 09:30:11.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2010-12-17 09:31:05.000000000 +0800
> > @@ -907,6 +907,9 @@ void balance_dirty_pages_ratelimited_nr(
> >  {
> >  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> >  
> > +	if (!mapping_cap_writeback_dirty(mapping))
> > +		return;
> > +
> >  	current->nr_dirtied += nr_pages_dirtied;
> >  
> >  	if (unlikely(!current->nr_dirtied_pause))
> 
> That would not really be the right patch to fix your oops, but it

Then it will also avoid oops in another tracepoint balance_dirty_pages.
([PATCH 21/35] writeback: trace balance_dirty_pages() in this series)

I skipped the backing_dev_info->dev check partly because it's also
referenced in tracepoint balance_dirty_pages. So I did this cure-all
change that makes sense in itself :)

> or something like would be a very sensible patch in its own right:
> looking back through old patches I never got around to sending in,
> I can see I had a very similar one two years ago, to save wasting
> time on dirty page accounting here when it's inappropriate.

It's a pity you didn't submit it.

> Though mine was testing !mapping_cap_account_dirty(mapping).

Sorry I didn't check whether to use mapping_cap_writeback_dirty() or
mapping_cap_account_dirty() -- I just used a random one of them.

Some double checking shows that the end results are the same as for
now: all related parts set both flags at the same time with
BDI_CAP_NO_ACCT_AND_WRITEBACK. However it does look more sane to use
bdi_cap_account_dirty(bdi). I'll switch to it. Thank you!

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 22/35] writeback: trace global dirty page states
@ 2010-12-17  9:31         ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-17  9:31 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Fri, Dec 17, 2010 at 02:52:50PM +0800, Hugh Dickins wrote:
> On Fri, 17 Dec 2010, Wu Fengguang wrote:
> > On Mon, Dec 13, 2010 at 10:47:08PM +0800, Wu, Fengguang wrote:
> > 
> > > +	TP_fast_assign(
> > > +		strlcpy(__entry->bdi,
> > > +			dev_name(mapping->backing_dev_info->dev), 32);
> > > +		__entry->ino			= mapping->host->i_ino;
> > 
> > I got an oops against the above line on shmem. Can be fixed by the
> > below patch, but still not 100% confident..
> > 
> > Thanks,
> > Fengguang
> > ---
> > Subject: writeback fix dereferencing NULL shmem mapping->host
> > Date: Thu Dec 16 22:22:00 CST 2010
> > 
> > The oops happens when doing "cp /proc/vmstat /dev/shm". It seems to be
> > triggered on accessing host->i_ino, since the offset of i_ino is exactly
> > 0x50. However I'm afraid the problem is not fully understand
> > 
> > 1) it's not normal that tmpfs will have mapping->host == NULL
> > 
> > 2) I tried removing the dereference as the below diff, however it
> >    didn't stop the oops. This is very weird.
> > 
> > TRACE_EVENT balance_dirty_state:
> > 
> >  	TP_fast_assign(
> >  		strlcpy(__entry->bdi,
> >  			dev_name(mapping->backing_dev_info->dev), 32);
> 
> I believe this line above is actually the problem: you can imagine that
> tmpfs leaves backing_dev_info->dev NULL, and dev_name() appears to

Ah, I didn't notice that obvious fact in shmem_backing_dev_info..

> access dev->init_name at 64-bit offset 0x50 down struct device.

And it's such a coincident that the two lines accessed different
struct members both with offset 0x50 :)

> > -		__entry->ino			= mapping->host->i_ino;
> >  		__entry->nr_dirty		= nr_dirty;
> >  		__entry->nr_writeback		= nr_writeback;
> >  		__entry->nr_unstable		= nr_unstable;
> ...
> > 
> > CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> 
> I prefer hughd@google.com, but the tiscali address survived unexpectedly.

OK, just updated my alias db.

> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/page-writeback.c |    3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > --- linux-next.orig/mm/page-writeback.c	2010-12-17 09:30:11.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2010-12-17 09:31:05.000000000 +0800
> > @@ -907,6 +907,9 @@ void balance_dirty_pages_ratelimited_nr(
> >  {
> >  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> >  
> > +	if (!mapping_cap_writeback_dirty(mapping))
> > +		return;
> > +
> >  	current->nr_dirtied += nr_pages_dirtied;
> >  
> >  	if (unlikely(!current->nr_dirtied_pause))
> 
> That would not really be the right patch to fix your oops, but it

Then it will also avoid oops in another tracepoint balance_dirty_pages.
([PATCH 21/35] writeback: trace balance_dirty_pages() in this series)

I skipped the backing_dev_info->dev check partly because it's also
referenced in tracepoint balance_dirty_pages. So I did this cure-all
change that makes sense in itself :)

> or something like would be a very sensible patch in its own right:
> looking back through old patches I never got around to sending in,
> I can see I had a very similar one two years ago, to save wasting
> time on dirty page accounting here when it's inappropriate.

It's a pity you didn't submit it.

> Though mine was testing !mapping_cap_account_dirty(mapping).

Sorry I didn't check whether to use mapping_cap_writeback_dirty() or
mapping_cap_account_dirty() -- I just used a random one of them.

Some double checking shows that the end results are the same as for
now: all related parts set both flags at the same time with
BDI_CAP_NO_ACCT_AND_WRITEBACK. However it does look more sane to use
bdi_cap_account_dirty(bdi). I'll switch to it. Thank you!

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH] writeback: skip balance_dirty_pages() for in-memory fs
  2010-12-17  6:52       ` Hugh Dickins
@ 2010-12-17 11:21         ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-17 11:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.

It also prevents

[  388.126563] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050

in the balance_dirty_pages tracepoint, which will call

	dev_name(mapping->backing_dev_info->dev)

but shmem_backing_dev_info.dev is NULL.

CC: Hugh Dickins <hughd@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    3 +++
 1 file changed, 3 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2010-12-17 19:09:19.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-17 19:09:22.000000000 +0800
@@ -899,6 +899,9 @@ void balance_dirty_pages_ratelimited_nr(
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
+	if (!bdi_cap_account_dirty(bdi))
+		return;
+
 	current->nr_dirtied += nr_pages_dirtied;
 
 	if (unlikely(!current->nr_dirtied_pause))

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH] writeback: skip balance_dirty_pages() for in-memory fs
@ 2010-12-17 11:21         ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-17 11:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.

It also prevents

[  388.126563] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050

in the balance_dirty_pages tracepoint, which will call

	dev_name(mapping->backing_dev_info->dev)

but shmem_backing_dev_info.dev is NULL.

CC: Hugh Dickins <hughd@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    3 +++
 1 file changed, 3 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2010-12-17 19:09:19.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-17 19:09:22.000000000 +0800
@@ -899,6 +899,9 @@ void balance_dirty_pages_ratelimited_nr(
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
+	if (!bdi_cap_account_dirty(bdi))
+		return;
+
 	current->nr_dirtied += nr_pages_dirtied;
 
 	if (unlikely(!current->nr_dirtied_pause))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time
  2010-12-15 18:48         ` Richard Kennedy
  (?)
  (?)
@ 2010-12-17 13:07         ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-17 13:07 UTC (permalink / raw)
  To: Richard Kennedy
  Cc: Andrew Morton, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: Type: text/plain, Size: 6332 bytes --]

On Thu, Dec 16, 2010 at 02:48:29AM +0800, Richard Kennedy wrote:
> On Tue, 2010-12-14 at 21:59 +0800, Wu Fengguang wrote:
> > On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
> > > Hi Fengguang,
> > > 
> > > I've been running my test set on your v3 series and generally it's
> > > giving good results in line with the mainline kernel, with much less
> > > variability and lower standard deviation of the results so it is much
> > > more repeatable.
> > 
> > Glad to hear that, and thank you very much for trying it out!
> > 
> > > However, it doesn't seem to be honouring the background_dirty_threshold.
> > 
> > > The attached graph is from a simple fio write test of 400Mb on ext4.
> > > All dirty pages are completely written in 15 seconds, but I expect to
> > > see up to background_dirty_threshold pages staying dirty until the 30
> > > second background task writes them out. So it is much too eager to write
> > > back dirty pages.
> >  
> > This is interesting, and seems easy to root cause. When testing v4,
> > would you help collect the following trace events?
> > 
> > echo 1 > /debug/tracing/events/writeback/balance_dirty_pages/enable
> > echo 1 > /debug/tracing/events/writeback/balance_dirty_state/enable
> > echo 1 > /debug/tracing/events/writeback/writeback_single_inode/enable
> > 
> > They'll have good opportunity to disclose the bug.
> > 
> > > As to the ramp up time, when writing to 2 disks at the same time I see
> > > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > > value after one of the write stops. So I think this could be speeded up
> > > even more, at least on my setup.
> > 
> > I have the roughly same ramp up time on the 1-disk 3GB mem test:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
> >  
> > Given that it's the typical desktop, it does seem reasonable to speed
> > it up further.
> > 
> > > I am just about to start testing v4 & will report anything interesting.
> > 
> > Thanks!
> > 
> > Fengguang
> 
> I just mailed the trace log to Fengguang, it is a bit big to post to
> this list. If anyone wants it, let me know and I'll mail to them
> directly.
> 
> I'm also seeing a write stall in some of my tests. When writing 400Mb
> after about 6 seconds I'm see a few seconds when there are no reported
> sectors written to sda & there are no pages under writeback although
> there are lots of dirty pages. ( the graph I sent previously shows this
> stall as well )

I managed to reproduce your workload, see the attached graphs. They
represents two runs of the following fio job. Obviously the results
are very reproducible.

        [zero]
        size=400m
        rw=write
        pre_read=1
        ioengine=mmap

Here is the trace data for the first graph. I'll explain how every
single write is triggered. Vanilla kernels should have the same
behaviors.

background threshold exceeded, so background flush is started
-------------------------------------------------------------
       flush-8:0-2662  [005]    18.759459: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=544 wrote=16385 to_write=-1 index=1
       flush-8:0-2662  [000]    19.941272: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=1732 wrote=16385 to_write=-1 index=16386
       flush-8:0-2662  [000]    20.162497: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=1952 wrote=4097 to_write=-1 index=32771


fio completes data population and does something like fsync()
Note that the dirty age is not reset by fsync().
-------------------------------------------------------------
           <...>-2637  [000]    25.364145: fdatawrite_range:              fio: bdi=8:0 ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES start=0 end=9223372036854775807 sync=1 wrote=65533 skipped=0
           <...>-2637  [004]    26.492765: fdatawrite_range:              fio: bdi=8:0 ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES start=0 end=9223372036854775807 sync=0 wrote=0 skipped=0


fio starts "rw=write", and triggered background flush when
background threshold is exceeded
----------------------------------------------------------
       flush-8:0-2662  [000]    33.277084: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_PAGES age=15112 wrote=16385 to_write=-1 index=1
       flush-8:0-2662  [000]    34.486721: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=16324 wrote=16385 to_write=-1 index=16386
       flush-8:0-2662  [000]    34.942939: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=16784 wrote=8193 to_write=-1 index=32771


5 seconds later, kupdate flush starts to work on expired inodes in
b_io *as well as* whatever inode that is already in the b_more_io
list.  Unfortunately inode 131 was moved to b_more_io in the previous
background flush and has been sit there ever since.
---------------------------------------------------------------------
       flush-8:0-2662  [004]    39.951920: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=21808 wrote=16385 to_write=-1 index=40964
       flush-8:0-2662  [000]    40.784427: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=22644 wrote=16385 to_write=-1 index=57349
       flush-8:0-2662  [000]    41.840671: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=23704 wrote=8193 to_write=-1 index=73734
       flush-8:0-2662  [004]    42.845739: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=24712 wrote=8193 to_write=-1 index=81927
       flush-8:0-2662  [004]    43.309379: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=25180 wrote=8193 to_write=-1 index=90120
       flush-8:0-2662  [000]    43.547443: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC age=25416 wrote=4088 to_write=12296 index=0


This may be a bit surprising, but should not be a big problem. After
all, the vm.dirty_expire_centisecs=30s merely says that dirty inodes
will be put to IO _within_ 35s. The kernel still have some freedom
to start writeback earlier than the deadline, or even miss the
deadline in the case of too busy IO.

Thanks,
Fengguang

[-- Attachment #2: global-dirty-state.png --]
[-- Type: image/png, Size: 74676 bytes --]

[-- Attachment #3: global-dirty-state.png --]
[-- Type: image/png, Size: 74622 bytes --]

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] writeback: skip balance_dirty_pages() for in-memory fs
  2010-12-17 11:21         ` Wu Fengguang
@ 2010-12-17 14:21           ` Rik van Riel
  -1 siblings, 0 replies; 202+ messages in thread
From: Rik van Riel @ 2010-12-17 14:21 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Hugh Dickins, Andrew Morton, Jan Kara, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Peter Zijlstra, Mel Gorman, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On 12/17/2010 06:21 AM, Wu Fengguang wrote:
> This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.
>
> It also prevents
>
> [  388.126563] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
>
> in the balance_dirty_pages tracepoint, which will call
>
> 	dev_name(mapping->backing_dev_info->dev)
>
> but shmem_backing_dev_info.dev is NULL.
>
> CC: Hugh Dickins<hughd@google.com>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] writeback: skip balance_dirty_pages() for in-memory fs
@ 2010-12-17 14:21           ` Rik van Riel
  0 siblings, 0 replies; 202+ messages in thread
From: Rik van Riel @ 2010-12-17 14:21 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Hugh Dickins, Andrew Morton, Jan Kara, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Peter Zijlstra, Mel Gorman, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On 12/17/2010 06:21 AM, Wu Fengguang wrote:
> This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.
>
> It also prevents
>
> [  388.126563] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
>
> in the balance_dirty_pages tracepoint, which will call
>
> 	dev_name(mapping->backing_dev_info->dev)
>
> but shmem_backing_dev_info.dev is NULL.
>
> CC: Hugh Dickins<hughd@google.com>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] writeback: skip balance_dirty_pages() for in-memory fs
  2010-12-17 11:21         ` Wu Fengguang
@ 2010-12-17 15:34           ` Minchan Kim
  -1 siblings, 0 replies; 202+ messages in thread
From: Minchan Kim @ 2010-12-17 15:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Hugh Dickins, Andrew Morton, Jan Kara, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Peter Zijlstra, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, linux-mm, linux-fsdevel, LKML

On Fri, Dec 17, 2010 at 8:21 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.
>
> It also prevents
>
> [  388.126563] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
>
> in the balance_dirty_pages tracepoint, which will call
>
>        dev_name(mapping->backing_dev_info->dev)
>
> but shmem_backing_dev_info.dev is NULL.
>
> CC: Hugh Dickins <hughd@google.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Is it a material for -stable?

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] writeback: skip balance_dirty_pages() for in-memory fs
@ 2010-12-17 15:34           ` Minchan Kim
  0 siblings, 0 replies; 202+ messages in thread
From: Minchan Kim @ 2010-12-17 15:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Hugh Dickins, Andrew Morton, Jan Kara, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Peter Zijlstra, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, linux-mm, linux-fsdevel, LKML

On Fri, Dec 17, 2010 at 8:21 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.
>
> It also prevents
>
> [  388.126563] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
>
> in the balance_dirty_pages tracepoint, which will call
>
>        dev_name(mapping->backing_dev_info->dev)
>
> but shmem_backing_dev_info.dev is NULL.
>
> CC: Hugh Dickins <hughd@google.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Is it a material for -stable?

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] writeback: skip balance_dirty_pages() for in-memory fs
  2010-12-17 15:34           ` Minchan Kim
@ 2010-12-17 15:42             ` Minchan Kim
  -1 siblings, 0 replies; 202+ messages in thread
From: Minchan Kim @ 2010-12-17 15:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Hugh Dickins, Andrew Morton, Jan Kara, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Peter Zijlstra, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, linux-mm, linux-fsdevel, LKML

On Sat, Dec 18, 2010 at 12:34 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Fri, Dec 17, 2010 at 8:21 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.
>>
>> It also prevents
>>
>> [  388.126563] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
>>
>> in the balance_dirty_pages tracepoint, which will call
>>
>>        dev_name(mapping->backing_dev_info->dev)
>>
>> but shmem_backing_dev_info.dev is NULL.
>>
>> CC: Hugh Dickins <hughd@google.com>
>> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
>
> Is it a material for -stable?

No. balance_dirty_pages tracepoint is new. :)

>
> --
> Kind regards,
> Minchan Kim
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] writeback: skip balance_dirty_pages() for in-memory fs
@ 2010-12-17 15:42             ` Minchan Kim
  0 siblings, 0 replies; 202+ messages in thread
From: Minchan Kim @ 2010-12-17 15:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Hugh Dickins, Andrew Morton, Jan Kara, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Peter Zijlstra, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, linux-mm, linux-fsdevel, LKML

On Sat, Dec 18, 2010 at 12:34 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Fri, Dec 17, 2010 at 8:21 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.
>>
>> It also prevents
>>
>> [  388.126563] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
>>
>> in the balance_dirty_pages tracepoint, which will call
>>
>>        dev_name(mapping->backing_dev_info->dev)
>>
>> but shmem_backing_dev_info.dev is NULL.
>>
>> CC: Hugh Dickins <hughd@google.com>
>> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
>
> Is it a material for -stable?

No. balance_dirty_pages tracepoint is new. :)

>
> --
> Kind regards,
> Minchan Kim
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] writeback: skip balance_dirty_pages() for in-memory fs
  2010-12-17 11:21         ` Wu Fengguang
@ 2010-12-21  5:59           ` Hugh Dickins
  -1 siblings, 0 replies; 202+ messages in thread
From: Hugh Dickins @ 2010-12-21  5:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Fri, 17 Dec 2010, Wu Fengguang wrote:

> This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.
> 
> It also prevents
> 
> [  388.126563] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
> 
> in the balance_dirty_pages tracepoint, which will call
> 
> 	dev_name(mapping->backing_dev_info->dev)
> 
> but shmem_backing_dev_info.dev is NULL.
> 
> CC: Hugh Dickins <hughd@google.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Whilst I do like this change, and I do think it's the right thing to do
(given that the bdi has explicitly opted out of what it then got into),
I've a sneaking feeling that something somewhere may show a regression
from it.  IIRC, there were circumstances in which it actually did
(inadvertently) end up throttling the tmpfs writing - if there were
too many dirty non-tmpfs pages around??

What am I saying?!  I think I'm asking you to look more closely at what
actually used to happen, and be more explicit about the behavior you're
stopping here - although the patch is mainly code optimization, there
is some functional change I think.  (You do mention throttling on
tmpfs/ramfs, but the way it worked out wasn't straightforward.)

I'd better not burble on for a third paragraph!

Hugh

> ---
>  mm/page-writeback.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2010-12-17 19:09:19.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-12-17 19:09:22.000000000 +0800
> @@ -899,6 +899,9 @@ void balance_dirty_pages_ratelimited_nr(
>  {
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  
> +	if (!bdi_cap_account_dirty(bdi))
> +		return;
> +
>  	current->nr_dirtied += nr_pages_dirtied;
>  
>  	if (unlikely(!current->nr_dirtied_pause))

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] writeback: skip balance_dirty_pages() for in-memory fs
@ 2010-12-21  5:59           ` Hugh Dickins
  0 siblings, 0 replies; 202+ messages in thread
From: Hugh Dickins @ 2010-12-21  5:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Fri, 17 Dec 2010, Wu Fengguang wrote:

> This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.
> 
> It also prevents
> 
> [  388.126563] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
> 
> in the balance_dirty_pages tracepoint, which will call
> 
> 	dev_name(mapping->backing_dev_info->dev)
> 
> but shmem_backing_dev_info.dev is NULL.
> 
> CC: Hugh Dickins <hughd@google.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Whilst I do like this change, and I do think it's the right thing to do
(given that the bdi has explicitly opted out of what it then got into),
I've a sneaking feeling that something somewhere may show a regression
from it.  IIRC, there were circumstances in which it actually did
(inadvertently) end up throttling the tmpfs writing - if there were
too many dirty non-tmpfs pages around??

What am I saying?!  I think I'm asking you to look more closely at what
actually used to happen, and be more explicit about the behavior you're
stopping here - although the patch is mainly code optimization, there
is some functional change I think.  (You do mention throttling on
tmpfs/ramfs, but the way it worked out wasn't straightforward.)

I'd better not burble on for a third paragraph!

Hugh

> ---
>  mm/page-writeback.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2010-12-17 19:09:19.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-12-17 19:09:22.000000000 +0800
> @@ -899,6 +899,9 @@ void balance_dirty_pages_ratelimited_nr(
>  {
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  
> +	if (!bdi_cap_account_dirty(bdi))
> +		return;
> +
>  	current->nr_dirtied += nr_pages_dirtied;
>  
>  	if (unlikely(!current->nr_dirtied_pause))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] writeback: skip balance_dirty_pages() for in-memory fs
  2010-12-21  5:59           ` Hugh Dickins
@ 2010-12-21  9:39             ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-21  9:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, Dec 21, 2010 at 01:59:46PM +0800, Hugh Dickins wrote:
> On Fri, 17 Dec 2010, Wu Fengguang wrote:
> 
> > This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.
> > 
> > It also prevents
> > 
> > [  388.126563] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
> > 
> > in the balance_dirty_pages tracepoint, which will call
> > 
> > 	dev_name(mapping->backing_dev_info->dev)
> > 
> > but shmem_backing_dev_info.dev is NULL.
> > 
> > CC: Hugh Dickins <hughd@google.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> 
> Whilst I do like this change, and I do think it's the right thing to do
> (given that the bdi has explicitly opted out of what it then got into),

Thanks.

> I've a sneaking feeling that something somewhere may show a regression
> from it.  IIRC, there were circumstances in which it actually did
> (inadvertently) end up throttling the tmpfs writing - if there were
> too many dirty non-tmpfs pages around??

Good point (that I missed!).

Here is the findings after double checks.

As for 2.6.36 and older kernels, the tmpfs writes will sleep inside
balance_dirty_pages() as long as we are over the (dirty+background)/2
global throttle threshold.  This is because both the dirty pages and
threshold will be 0 for tmpfs/ramfs. Hence this test will always
evaluate to TRUE:

                dirty_exceeded =
                        (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
                        || (nr_reclaimable + nr_writeback >= dirty_thresh);

As for 2.6.37, someone complained that the current logic does not
allow the users to set vm.dirty_ratio=0.  So the to-be-released 2.6.37
will have this change (commit 4cbec4c8b9)

@@ -542,8 +536,8 @@ static void balance_dirty_pages(struct address_space *mapping,
                 * the last resort safeguard.
                 */
                dirty_exceeded =
-                       (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
-                       || (nr_reclaimable + nr_writeback >= dirty_thresh);
+                       (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
+                       || (nr_reclaimable + nr_writeback > dirty_thresh);

So for 2.6.37 it will behave differently for tmpfs/ramfs: it will
never get throttled unless the global dirty threshold is exceeded,
which is very unlikely to happen (once happen, will block many tasks).

I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means
for a busy writing server, tmpfs write()s may get livelocked! The
"inadvertent" throttling can hardly bring help to any workload because
of its "either no throttling, or get throttled to death" property.

So based on 2.6.37, this patch won't bring more noticeable changes.

> What am I saying?!  I think I'm asking you to look more closely at what
> actually used to happen, and be more explicit about the behavior you're
> stopping here - although the patch is mainly code optimization, there
> is some functional change I think.  (You do mention throttling on
> tmpfs/ramfs, but the way it worked out wasn't straightforward.)

Good suggestion, thanks!

> I'd better not burble on for a third paragraph!

How about this updated patch?

Thanks,
Fengguang
---
Subject: writeback: skip balance_dirty_pages() for in-memory fs
Date: Thu Dec 16 22:22:00 CST 2010

This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.

It also prevents

[  388.126563] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050

in the balance_dirty_pages tracepoint, which will call

	dev_name(mapping->backing_dev_info->dev)

but shmem_backing_dev_info.dev is NULL.

Summary notes about the tmpfs/ramfs behavior changes:

As for 2.6.36 and older kernels, the tmpfs writes will sleep inside
balance_dirty_pages() as long as we are over the (dirty+background)/2
global throttle threshold.  This is because both the dirty pages and
threshold will be 0 for tmpfs/ramfs. Hence this test will always
evaluate to TRUE:

                dirty_exceeded =
                        (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
                        || (nr_reclaimable + nr_writeback >= dirty_thresh);

For 2.6.37, someone complained that the current logic does not allow the
users to set vm.dirty_ratio=0.  So commit 4cbec4c8b9 changed the test to

                dirty_exceeded =
                        (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
                        || (nr_reclaimable + nr_writeback > dirty_thresh);

So 2.6.37 will behave differently for tmpfs/ramfs: it will never get
throttled unless the global dirty threshold is exceeded (which is very
unlikely to happen; once happen, will block many tasks).

I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means
for a busy writing server, tmpfs write()s may get livelocked! The
"inadvertent" throttling can hardly bring help to any workload because
of its "either no throttling, or get throttled to death" property.

So based on 2.6.37, this patch won't bring more noticeable changes.

CC: Hugh Dickins <hughd@google.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-18 09:14:53.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-21 17:35:44.000000000 +0800
@@ -230,13 +230,8 @@ void task_dirty_inc(struct task_struct *
 static void bdi_writeout_fraction(struct backing_dev_info *bdi,
 		long *numerator, long *denominator)
 {
-	if (bdi_cap_writeback_dirty(bdi)) {
-		prop_fraction_percpu(&vm_completions, &bdi->completions,
+	prop_fraction_percpu(&vm_completions, &bdi->completions,
 				numerator, denominator);
-	} else {
-		*numerator = 0;
-		*denominator = 1;
-	}
 }
 
 static inline void task_dirties_fraction(struct task_struct *tsk,
@@ -878,6 +873,9 @@ void balance_dirty_pages_ratelimited_nr(
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
+	if (!bdi_cap_account_dirty(bdi))
+		return;
+
 	current->nr_dirtied += nr_pages_dirtied;
 
 	if (unlikely(!current->nr_dirtied_pause))

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] writeback: skip balance_dirty_pages() for in-memory fs
@ 2010-12-21  9:39             ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2010-12-21  9:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, Dec 21, 2010 at 01:59:46PM +0800, Hugh Dickins wrote:
> On Fri, 17 Dec 2010, Wu Fengguang wrote:
> 
> > This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.
> > 
> > It also prevents
> > 
> > [  388.126563] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
> > 
> > in the balance_dirty_pages tracepoint, which will call
> > 
> > 	dev_name(mapping->backing_dev_info->dev)
> > 
> > but shmem_backing_dev_info.dev is NULL.
> > 
> > CC: Hugh Dickins <hughd@google.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> 
> Whilst I do like this change, and I do think it's the right thing to do
> (given that the bdi has explicitly opted out of what it then got into),

Thanks.

> I've a sneaking feeling that something somewhere may show a regression
> from it.  IIRC, there were circumstances in which it actually did
> (inadvertently) end up throttling the tmpfs writing - if there were
> too many dirty non-tmpfs pages around??

Good point (that I missed!).

Here is the findings after double checks.

As for 2.6.36 and older kernels, the tmpfs writes will sleep inside
balance_dirty_pages() as long as we are over the (dirty+background)/2
global throttle threshold.  This is because both the dirty pages and
threshold will be 0 for tmpfs/ramfs. Hence this test will always
evaluate to TRUE:

                dirty_exceeded =
                        (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
                        || (nr_reclaimable + nr_writeback >= dirty_thresh);

As for 2.6.37, someone complained that the current logic does not
allow the users to set vm.dirty_ratio=0.  So the to-be-released 2.6.37
will have this change (commit 4cbec4c8b9)

@@ -542,8 +536,8 @@ static void balance_dirty_pages(struct address_space *mapping,
                 * the last resort safeguard.
                 */
                dirty_exceeded =
-                       (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
-                       || (nr_reclaimable + nr_writeback >= dirty_thresh);
+                       (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
+                       || (nr_reclaimable + nr_writeback > dirty_thresh);

So for 2.6.37 it will behave differently for tmpfs/ramfs: it will
never get throttled unless the global dirty threshold is exceeded,
which is very unlikely to happen (once happen, will block many tasks).

I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means
for a busy writing server, tmpfs write()s may get livelocked! The
"inadvertent" throttling can hardly bring help to any workload because
of its "either no throttling, or get throttled to death" property.

So based on 2.6.37, this patch won't bring more noticeable changes.

> What am I saying?!  I think I'm asking you to look more closely at what
> actually used to happen, and be more explicit about the behavior you're
> stopping here - although the patch is mainly code optimization, there
> is some functional change I think.  (You do mention throttling on
> tmpfs/ramfs, but the way it worked out wasn't straightforward.)

Good suggestion, thanks!

> I'd better not burble on for a third paragraph!

How about this updated patch?

Thanks,
Fengguang
---
Subject: writeback: skip balance_dirty_pages() for in-memory fs
Date: Thu Dec 16 22:22:00 CST 2010

This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.

It also prevents

[  388.126563] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050

in the balance_dirty_pages tracepoint, which will call

	dev_name(mapping->backing_dev_info->dev)

but shmem_backing_dev_info.dev is NULL.

Summary notes about the tmpfs/ramfs behavior changes:

As for 2.6.36 and older kernels, the tmpfs writes will sleep inside
balance_dirty_pages() as long as we are over the (dirty+background)/2
global throttle threshold.  This is because both the dirty pages and
threshold will be 0 for tmpfs/ramfs. Hence this test will always
evaluate to TRUE:

                dirty_exceeded =
                        (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
                        || (nr_reclaimable + nr_writeback >= dirty_thresh);

For 2.6.37, someone complained that the current logic does not allow the
users to set vm.dirty_ratio=0.  So commit 4cbec4c8b9 changed the test to

                dirty_exceeded =
                        (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
                        || (nr_reclaimable + nr_writeback > dirty_thresh);

So 2.6.37 will behave differently for tmpfs/ramfs: it will never get
throttled unless the global dirty threshold is exceeded (which is very
unlikely to happen; once happen, will block many tasks).

I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means
for a busy writing server, tmpfs write()s may get livelocked! The
"inadvertent" throttling can hardly bring help to any workload because
of its "either no throttling, or get throttled to death" property.

So based on 2.6.37, this patch won't bring more noticeable changes.

CC: Hugh Dickins <hughd@google.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-18 09:14:53.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-21 17:35:44.000000000 +0800
@@ -230,13 +230,8 @@ void task_dirty_inc(struct task_struct *
 static void bdi_writeout_fraction(struct backing_dev_info *bdi,
 		long *numerator, long *denominator)
 {
-	if (bdi_cap_writeback_dirty(bdi)) {
-		prop_fraction_percpu(&vm_completions, &bdi->completions,
+	prop_fraction_percpu(&vm_completions, &bdi->completions,
 				numerator, denominator);
-	} else {
-		*numerator = 0;
-		*denominator = 1;
-	}
 }
 
 static inline void task_dirties_fraction(struct task_struct *tsk,
@@ -878,6 +873,9 @@ void balance_dirty_pages_ratelimited_nr(
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
+	if (!bdi_cap_account_dirty(bdi))
+		return;
+
 	current->nr_dirtied += nr_pages_dirtied;
 
 	if (unlikely(!current->nr_dirtied_pause))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] writeback: skip balance_dirty_pages() for in-memory fs
  2010-12-21  9:39             ` Wu Fengguang
@ 2010-12-30  3:15               ` Hugh Dickins
  -1 siblings, 0 replies; 202+ messages in thread
From: Hugh Dickins @ 2010-12-30  3:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, 21 Dec 2010, Wu Fengguang wrote:
> 
> This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.
> 
> It also prevents
> 
> [  388.126563] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
> 
> in the balance_dirty_pages tracepoint, which will call
> 
> 	dev_name(mapping->backing_dev_info->dev)
> 
> but shmem_backing_dev_info.dev is NULL.
> 
> Summary notes about the tmpfs/ramfs behavior changes:
> 
> As for 2.6.36 and older kernels, the tmpfs writes will sleep inside
> balance_dirty_pages() as long as we are over the (dirty+background)/2
> global throttle threshold.  This is because both the dirty pages and
> threshold will be 0 for tmpfs/ramfs. Hence this test will always
> evaluate to TRUE:
> 
>                 dirty_exceeded =
>                         (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
>                         || (nr_reclaimable + nr_writeback >= dirty_thresh);
> 
> For 2.6.37, someone complained that the current logic does not allow the
> users to set vm.dirty_ratio=0.  So commit 4cbec4c8b9 changed the test to
> 
>                 dirty_exceeded =
>                         (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
>                         || (nr_reclaimable + nr_writeback > dirty_thresh);
> 
> So 2.6.37 will behave differently for tmpfs/ramfs: it will never get
> throttled unless the global dirty threshold is exceeded (which is very
> unlikely to happen; once happen, will block many tasks).
> 
> I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means
> for a busy writing server, tmpfs write()s may get livelocked! The
> "inadvertent" throttling can hardly bring help to any workload because
> of its "either no throttling, or get throttled to death" property.
> 
> So based on 2.6.37, this patch won't bring more noticeable changes.
> 
> CC: Hugh Dickins <hughd@google.com>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Acked-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Thanks a lot for investigating further and writing it all up here.

Acked-by: Hugh Dickins <hughd@google.com>

I notice bdi_cap_writeback_dirty go from bdi_writeout_fraction(), and
bdi_cap_account_dirty appear in balance_dirty_pages_ratelimited_nr():
maybe one day a patch to use just one flag throughout?  Unless you can
dream up a use for the divergence.  (I hate wasting brainpower trying to
decide which of two always-the-sames to use, like page_cache_release()
and put_page(), until there's actual code to distinguish them.)

Hugh

> ---
>  mm/page-writeback.c |   10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2010-12-18 09:14:53.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-12-21 17:35:44.000000000 +0800
> @@ -230,13 +230,8 @@ void task_dirty_inc(struct task_struct *
>  static void bdi_writeout_fraction(struct backing_dev_info *bdi,
>  		long *numerator, long *denominator)
>  {
> -	if (bdi_cap_writeback_dirty(bdi)) {
> -		prop_fraction_percpu(&vm_completions, &bdi->completions,
> +	prop_fraction_percpu(&vm_completions, &bdi->completions,
>  				numerator, denominator);
> -	} else {
> -		*numerator = 0;
> -		*denominator = 1;
> -	}
>  }
>  
>  static inline void task_dirties_fraction(struct task_struct *tsk,
> @@ -878,6 +873,9 @@ void balance_dirty_pages_ratelimited_nr(
>  {
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  
> +	if (!bdi_cap_account_dirty(bdi))
> +		return;
> +
>  	current->nr_dirtied += nr_pages_dirtied;
>  
>  	if (unlikely(!current->nr_dirtied_pause))

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] writeback: skip balance_dirty_pages() for in-memory fs
@ 2010-12-30  3:15               ` Hugh Dickins
  0 siblings, 0 replies; 202+ messages in thread
From: Hugh Dickins @ 2010-12-30  3:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Tue, 21 Dec 2010, Wu Fengguang wrote:
> 
> This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.
> 
> It also prevents
> 
> [  388.126563] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
> 
> in the balance_dirty_pages tracepoint, which will call
> 
> 	dev_name(mapping->backing_dev_info->dev)
> 
> but shmem_backing_dev_info.dev is NULL.
> 
> Summary notes about the tmpfs/ramfs behavior changes:
> 
> As for 2.6.36 and older kernels, the tmpfs writes will sleep inside
> balance_dirty_pages() as long as we are over the (dirty+background)/2
> global throttle threshold.  This is because both the dirty pages and
> threshold will be 0 for tmpfs/ramfs. Hence this test will always
> evaluate to TRUE:
> 
>                 dirty_exceeded =
>                         (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
>                         || (nr_reclaimable + nr_writeback >= dirty_thresh);
> 
> For 2.6.37, someone complained that the current logic does not allow the
> users to set vm.dirty_ratio=0.  So commit 4cbec4c8b9 changed the test to
> 
>                 dirty_exceeded =
>                         (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
>                         || (nr_reclaimable + nr_writeback > dirty_thresh);
> 
> So 2.6.37 will behave differently for tmpfs/ramfs: it will never get
> throttled unless the global dirty threshold is exceeded (which is very
> unlikely to happen; once happen, will block many tasks).
> 
> I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means
> for a busy writing server, tmpfs write()s may get livelocked! The
> "inadvertent" throttling can hardly bring help to any workload because
> of its "either no throttling, or get throttled to death" property.
> 
> So based on 2.6.37, this patch won't bring more noticeable changes.
> 
> CC: Hugh Dickins <hughd@google.com>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Acked-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Thanks a lot for investigating further and writing it all up here.

Acked-by: Hugh Dickins <hughd@google.com>

I notice bdi_cap_writeback_dirty go from bdi_writeout_fraction(), and
bdi_cap_account_dirty appear in balance_dirty_pages_ratelimited_nr():
maybe one day a patch to use just one flag throughout?  Unless you can
dream up a use for the divergence.  (I hate wasting brainpower trying to
decide which of two always-the-sames to use, like page_cache_release()
and put_page(), until there's actual code to distinguish them.)

Hugh

> ---
>  mm/page-writeback.c |   10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2010-12-18 09:14:53.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-12-21 17:35:44.000000000 +0800
> @@ -230,13 +230,8 @@ void task_dirty_inc(struct task_struct *
>  static void bdi_writeout_fraction(struct backing_dev_info *bdi,
>  		long *numerator, long *denominator)
>  {
> -	if (bdi_cap_writeback_dirty(bdi)) {
> -		prop_fraction_percpu(&vm_completions, &bdi->completions,
> +	prop_fraction_percpu(&vm_completions, &bdi->completions,
>  				numerator, denominator);
> -	} else {
> -		*numerator = 0;
> -		*denominator = 1;
> -	}
>  }
>  
>  static inline void task_dirties_fraction(struct task_struct *tsk,
> @@ -878,6 +873,9 @@ void balance_dirty_pages_ratelimited_nr(
>  {
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  
> +	if (!bdi_cap_account_dirty(bdi))
> +		return;
> +
>  	current->nr_dirtied += nr_pages_dirtied;
>  
>  	if (unlikely(!current->nr_dirtied_pause))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 01/35] writeback: enabling gate limit for light dirtied bdi
  2010-12-13 14:46   ` Wu Fengguang
@ 2011-01-12 21:43     ` Jan Kara
  -1 siblings, 0 replies; 202+ messages in thread
From: Jan Kara @ 2011-01-12 21:43 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Rik van Riel, Peter Zijlstra,
	Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

  Hi Fengguang,

On Mon 13-12-10 22:46:47, Wu Fengguang wrote:
> I noticed that my NFSROOT test system goes slow responding when there
> is heavy dd to a local disk. Traces show that the NFSROOT's bdi limit
> is near 0 and many tasks in the system are repeatedly stuck in
> balance_dirty_pages().
> 
> There are two generic problems:
> 
> - light dirtiers at one device (more often than not the rootfs) get
>   heavily impacted by heavy dirtiers on another independent device
> 
> - the light dirtied device does heavy throttling because bdi limit=0,
>   and the heavy throttling may in turn withhold its bdi limit in 0 as
>   it cannot dirty fast enough to grow up the bdi's proportional weight.
> 
> Fix it by introducing some "low pass" gate, which is a small (<=32MB)
> value reserved by others and can be safely "stole" from the current
> global dirty margin.  It does not need to be big to help the bdi gain
> its initial weight.
  I'm sorry for a late reply but I didn't get earlier to your patches...

...
> -unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
> + *
> + * There is a chicken and egg problem: when bdi A (eg. /pub) is heavy dirtied
> + * and bdi B (eg. /) is light dirtied hence has 0 dirty limit, tasks writing to
> + * B always get heavily throttled and bdi B's dirty limit might never be able
> + * to grow up from 0. So we do tricks to reserve some global margin and honour
> + * it to the bdi's that run low.
> + */
> +unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
> +			      unsigned long dirty,
> +			      unsigned long dirty_pages)
>  {
>  	u64 bdi_dirty;
>  	long numerator, denominator;
>  
>  	/*
> +	 * Provide a global safety margin of ~1%, or up to 32MB for a 20GB box.
> +	 */
> +	dirty -= min(dirty / 128, 32768UL >> (PAGE_SHIFT-10));
> +
> +	/*
>  	 * Calculate this BDI's share of the dirty ratio.
>  	 */
>  	bdi_writeout_fraction(bdi, &numerator, &denominator);
> @@ -459,6 +472,15 @@ unsigned long bdi_dirty_limit(struct bac
>  	do_div(bdi_dirty, denominator);
>  
>  	bdi_dirty += (dirty * bdi->min_ratio) / 100;
> +
> +	/*
> +	 * If we can dirty N more pages globally, honour N/2 to the bdi that
> +	 * runs low, so as to help it ramp up.
> +	 */
> +	if (unlikely(bdi_dirty < (dirty - dirty_pages) / 2 &&
> +		     dirty > dirty_pages))
> +		bdi_dirty = (dirty - dirty_pages) / 2;
> +
I wonder how well this works - have you tried that? Because from my naive
understanding if we have say two drives - sda, sdb. Someone is banging sda
really hard (several processes writing to the disk as fast as they can), then
we are really close to dirty limit anyway and thus we won't give much space
for sdb to ramp up it's writeout fraction...  Didn't you intend to use
'dirty' without the safety margin subtracted in the above condition? That
would then make more sense to me (i.e. those 32MB are then used as the
ramp-up area).

If I'm right in the above, maybe you could simplify the above condition to:
if (bdi_dirty < margin)
	bdi_dirty = margin;

Effectively it seems rather similar to me and it's immediately obvious how
it behales. Global limit is enforced anyway so the logic just differs in
the number of dirtiers on ramping-up bdi you need to suck out the margin.

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 01/35] writeback: enabling gate limit for light dirtied bdi
@ 2011-01-12 21:43     ` Jan Kara
  0 siblings, 0 replies; 202+ messages in thread
From: Jan Kara @ 2011-01-12 21:43 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Rik van Riel, Peter Zijlstra,
	Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

  Hi Fengguang,

On Mon 13-12-10 22:46:47, Wu Fengguang wrote:
> I noticed that my NFSROOT test system goes slow responding when there
> is heavy dd to a local disk. Traces show that the NFSROOT's bdi limit
> is near 0 and many tasks in the system are repeatedly stuck in
> balance_dirty_pages().
> 
> There are two generic problems:
> 
> - light dirtiers at one device (more often than not the rootfs) get
>   heavily impacted by heavy dirtiers on another independent device
> 
> - the light dirtied device does heavy throttling because bdi limit=0,
>   and the heavy throttling may in turn withhold its bdi limit in 0 as
>   it cannot dirty fast enough to grow up the bdi's proportional weight.
> 
> Fix it by introducing some "low pass" gate, which is a small (<=32MB)
> value reserved by others and can be safely "stole" from the current
> global dirty margin.  It does not need to be big to help the bdi gain
> its initial weight.
  I'm sorry for a late reply but I didn't get earlier to your patches...

...
> -unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
> + *
> + * There is a chicken and egg problem: when bdi A (eg. /pub) is heavy dirtied
> + * and bdi B (eg. /) is light dirtied hence has 0 dirty limit, tasks writing to
> + * B always get heavily throttled and bdi B's dirty limit might never be able
> + * to grow up from 0. So we do tricks to reserve some global margin and honour
> + * it to the bdi's that run low.
> + */
> +unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
> +			      unsigned long dirty,
> +			      unsigned long dirty_pages)
>  {
>  	u64 bdi_dirty;
>  	long numerator, denominator;
>  
>  	/*
> +	 * Provide a global safety margin of ~1%, or up to 32MB for a 20GB box.
> +	 */
> +	dirty -= min(dirty / 128, 32768UL >> (PAGE_SHIFT-10));
> +
> +	/*
>  	 * Calculate this BDI's share of the dirty ratio.
>  	 */
>  	bdi_writeout_fraction(bdi, &numerator, &denominator);
> @@ -459,6 +472,15 @@ unsigned long bdi_dirty_limit(struct bac
>  	do_div(bdi_dirty, denominator);
>  
>  	bdi_dirty += (dirty * bdi->min_ratio) / 100;
> +
> +	/*
> +	 * If we can dirty N more pages globally, honour N/2 to the bdi that
> +	 * runs low, so as to help it ramp up.
> +	 */
> +	if (unlikely(bdi_dirty < (dirty - dirty_pages) / 2 &&
> +		     dirty > dirty_pages))
> +		bdi_dirty = (dirty - dirty_pages) / 2;
> +
I wonder how well this works - have you tried that? Because from my naive
understanding if we have say two drives - sda, sdb. Someone is banging sda
really hard (several processes writing to the disk as fast as they can), then
we are really close to dirty limit anyway and thus we won't give much space
for sdb to ramp up it's writeout fraction...  Didn't you intend to use
'dirty' without the safety margin subtracted in the above condition? That
would then make more sense to me (i.e. those 32MB are then used as the
ramp-up area).

If I'm right in the above, maybe you could simplify the above condition to:
if (bdi_dirty < margin)
	bdi_dirty = margin;

Effectively it seems rather similar to me and it's immediately obvious how
it behales. Global limit is enforced anyway so the logic just differs in
the number of dirtiers on ramping-up bdi you need to suck out the margin.

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 02/35] writeback: safety margin for bdi stat error
  2010-12-13 14:46   ` Wu Fengguang
@ 2011-01-12 21:59     ` Jan Kara
  -1 siblings, 0 replies; 202+ messages in thread
From: Jan Kara @ 2011-01-12 21:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Mon 13-12-10 22:46:48, Wu Fengguang wrote:
> In a simple dd test on a 8p system with "mem=256M", I find all light
> dirtier tasks on the root fs are get heavily throttled. That happens
> because the global limit is exceeded. It's unbelievable at first sight,
> because the test fs doing the heavy dd is under its bdi limit.  After
> doing some tracing, it's discovered that
> 
>         bdi_dirty < bdi_dirty_limit() < global_dirty_limit() < nr_dirty
          ^^ bdi_dirty is the number of pages dirtied on BDI? I.e.
bdi_nr_reclaimable + bdi_nr_writeback?

> So the root cause is, the bdi_dirty is well under the global nr_dirty
> due to accounting errors. This can be fixed by using bdi_stat_sum(),
  So which statistic had the big error? I'd just like to understand
this (and how come your patch improves the situation)...

> however that's costly on large NUMA machines. So do a less costly fix
> of lowering the bdi limit, so that the accounting errors won't lead to
> the absurd situation "global limit exceeded but bdi limit not exceeded".
> 
> This provides guarantee when there is only 1 heavily dirtied bdi, and
> works by opportunity for 2+ heavy dirtied bdi's (hopefully they won't
> reach big error _and_ exceed their bdi limit at the same time).
> 
...
> @@ -458,6 +464,14 @@ unsigned long bdi_dirty_limit(struct bac
>  	long numerator, denominator;
>  
>  	/*
> +	 * try to prevent "global limit exceeded but bdi limit not exceeded"
> +	 */
> +	if (likely(dirty > bdi_stat_error(bdi)))
> +		dirty -= bdi_stat_error(bdi);
> +	else
> +		return 0;
> +
  Ugh, so if by any chance global_dirty_limit() <= bdi_stat_error(bdi), you
will limit number of unreclaimable pages for that bdi 0? Why?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 02/35] writeback: safety margin for bdi stat error
@ 2011-01-12 21:59     ` Jan Kara
  0 siblings, 0 replies; 202+ messages in thread
From: Jan Kara @ 2011-01-12 21:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Mon 13-12-10 22:46:48, Wu Fengguang wrote:
> In a simple dd test on a 8p system with "mem=256M", I find all light
> dirtier tasks on the root fs are get heavily throttled. That happens
> because the global limit is exceeded. It's unbelievable at first sight,
> because the test fs doing the heavy dd is under its bdi limit.  After
> doing some tracing, it's discovered that
> 
>         bdi_dirty < bdi_dirty_limit() < global_dirty_limit() < nr_dirty
          ^^ bdi_dirty is the number of pages dirtied on BDI? I.e.
bdi_nr_reclaimable + bdi_nr_writeback?

> So the root cause is, the bdi_dirty is well under the global nr_dirty
> due to accounting errors. This can be fixed by using bdi_stat_sum(),
  So which statistic had the big error? I'd just like to understand
this (and how come your patch improves the situation)...

> however that's costly on large NUMA machines. So do a less costly fix
> of lowering the bdi limit, so that the accounting errors won't lead to
> the absurd situation "global limit exceeded but bdi limit not exceeded".
> 
> This provides guarantee when there is only 1 heavily dirtied bdi, and
> works by opportunity for 2+ heavy dirtied bdi's (hopefully they won't
> reach big error _and_ exceed their bdi limit at the same time).
> 
...
> @@ -458,6 +464,14 @@ unsigned long bdi_dirty_limit(struct bac
>  	long numerator, denominator;
>  
>  	/*
> +	 * try to prevent "global limit exceeded but bdi limit not exceeded"
> +	 */
> +	if (likely(dirty > bdi_stat_error(bdi)))
> +		dirty -= bdi_stat_error(bdi);
> +	else
> +		return 0;
> +
  Ugh, so if by any chance global_dirty_limit() <= bdi_stat_error(bdi), you
will limit number of unreclaimable pages for that bdi 0? Why?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 01/35] writeback: enabling gate limit for light dirtied bdi
  2011-01-12 21:43     ` Jan Kara
  (?)
@ 2011-01-13  3:44     ` Wu Fengguang
  2011-01-13  3:58         ` Wu Fengguang
  2011-01-13 19:26         ` Peter Zijlstra
  -1 siblings, 2 replies; 202+ messages in thread
From: Wu Fengguang @ 2011-01-13  3:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Rik van Riel, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, KOSAKI Motohiro, Greg Thelen, Minchan Kim, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: Type: text/plain, Size: 6152 bytes --]

Hi Jan,

On Thu, Jan 13, 2011 at 05:43:03AM +0800, Jan Kara wrote:
>   Hi Fengguang,
> 
> On Mon 13-12-10 22:46:47, Wu Fengguang wrote:
> > I noticed that my NFSROOT test system goes slow responding when there
> > is heavy dd to a local disk. Traces show that the NFSROOT's bdi limit
> > is near 0 and many tasks in the system are repeatedly stuck in
> > balance_dirty_pages().
> > 
> > There are two generic problems:
> > 
> > - light dirtiers at one device (more often than not the rootfs) get
> >   heavily impacted by heavy dirtiers on another independent device
> > 
> > - the light dirtied device does heavy throttling because bdi limit=0,
> >   and the heavy throttling may in turn withhold its bdi limit in 0 as
> >   it cannot dirty fast enough to grow up the bdi's proportional weight.
> > 
> > Fix it by introducing some "low pass" gate, which is a small (<=32MB)
> > value reserved by others and can be safely "stole" from the current
> > global dirty margin.  It does not need to be big to help the bdi gain
> > its initial weight.
>   I'm sorry for a late reply but I didn't get earlier to your patches...

It's fine. Honestly speaking, the patches are still some "experiments",
and will need some major refactor. When testing 10-disk JBOD setup, I
find that bdi_dirty_limit fluctuations too much. So I'm considering
use global_dirty_limit as control target.

Attached is the JBOD test result for XFS. Other filesystems share the
same problem more or less.  Here you can find some old graphs:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-10HDD-JBOD/

> ...
> > -unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
> > + *
> > + * There is a chicken and egg problem: when bdi A (eg. /pub) is heavy dirtied
> > + * and bdi B (eg. /) is light dirtied hence has 0 dirty limit, tasks writing to
> > + * B always get heavily throttled and bdi B's dirty limit might never be able
> > + * to grow up from 0. So we do tricks to reserve some global margin and honour
> > + * it to the bdi's that run low.
> > + */
> > +unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
> > +			      unsigned long dirty,
> > +			      unsigned long dirty_pages)
> >  {
> >  	u64 bdi_dirty;
> >  	long numerator, denominator;
> >  
> >  	/*
> > +	 * Provide a global safety margin of ~1%, or up to 32MB for a 20GB box.
> > +	 */
> > +	dirty -= min(dirty / 128, 32768UL >> (PAGE_SHIFT-10));
> > +
> > +	/*
> >  	 * Calculate this BDI's share of the dirty ratio.
> >  	 */
> >  	bdi_writeout_fraction(bdi, &numerator, &denominator);
> > @@ -459,6 +472,15 @@ unsigned long bdi_dirty_limit(struct bac
> >  	do_div(bdi_dirty, denominator);
> >  
> >  	bdi_dirty += (dirty * bdi->min_ratio) / 100;
> > +
> > +	/*
> > +	 * If we can dirty N more pages globally, honour N/2 to the bdi that
> > +	 * runs low, so as to help it ramp up.
> > +	 */
> > +	if (unlikely(bdi_dirty < (dirty - dirty_pages) / 2 &&
> > +		     dirty > dirty_pages))
> > +		bdi_dirty = (dirty - dirty_pages) / 2;
> > +
> I wonder how well this works - have you tried that? Because from my naive

Yes I've been running it in the tests. It does show some undesirable
effects in multi-disk tests. For example, it leads to more than
necessary high bdi_dirty_limit for the slow USB key in the test case
of concurrent writing to 1 UKEY and 1 HDD. See the second graph.
You'll see that it's taking long time for the UKEY's bdi_dirty_limit
to shrink back to normal. The avg_dirty and bdi_dirty are also
departing too much. I'll fix them in the next update, where
bdi_dirty_limit will no longer play as big role as current code, and
this patch will also need to be reconsidered and may look much
different then.

> understanding if we have say two drives - sda, sdb. Someone is banging sda
> really hard (several processes writing to the disk as fast as they can), then
> we are really close to dirty limit anyway and thus we won't give much space
> for sdb to ramp up it's writeout fraction...  Didn't you intend to use
> 'dirty' without the safety margin subtracted in the above condition? That
> would then make more sense to me (i.e. those 32MB are then used as the
> ramp-up area).
> 
> If I'm right in the above, maybe you could simplify the above condition to:
> if (bdi_dirty < margin)
> 	bdi_dirty = margin;
> 
> Effectively it seems rather similar to me and it's immediately obvious how
> it behales. Global limit is enforced anyway so the logic just differs in
> the number of dirtiers on ramping-up bdi you need to suck out the margin.

sigh.. I've been hassled a lot by the possible disharmonies between
the bdi/global dirty limits.

One example is the below graph, where the bdi dirty pages are
constantly exceeding the bdi dirty limit. The root cause is,
"(dirty + background) / 2" may be close to or even exceed
bdi_dirty_limit. 

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/256M/ext3-2dd-1M-8p-191M-2.6.37-rc5+-2010-12-09-13-42/dirty-pages-200.png

Another problem is the btrfs JBOD case, where the global limit can be
exceeded at times. The root cause is, some bdi limits are dropping and
some others are increasing. If the bdi dirty limit drop too fast -- so
that it drops below its dirty pages, then even if the sum of all bdi
dirty limits are below the global limit, the sum of all bdi dirty
pages could still exceed the global limit.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-10HDD-JBOD/btrfs-fio-jbod-sync-128k-24p-15977M-2.6.37-rc8-dt5+-2010-12-31-10-06/global_dirty_state.png

The "enforced" global limit will jump into action here. However it
turns out to be a very undesirable behavior. In the tests, I run some
tasks to collect vmstat information. Whenever the global limit is
exceeded, I'll see disrupted samples in the vmstat graph. So when the
global limit is exceeded, it will block _all_ dirtiers in the system,
whether or not it is a light dirtier or an independent fast storage.

I hope the move to global dirty pages/limit as main control feedback
and bdi_dirty_limit as the secondary control feedback will help
address the problem nicely.

Thanks,
Fengguang

[-- Attachment #2: xfs-jbod-balance_dirty_pages-pages.png --]
[-- Type: image/png, Size: 344313 bytes --]

[-- Attachment #3: ukey+hdd-balance_dirty_pages-pages.png --]
[-- Type: image/png, Size: 112702 bytes --]

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 01/35] writeback: enabling gate limit for light dirtied bdi
  2011-01-13  3:44     ` Wu Fengguang
@ 2011-01-13  3:58         ` Wu Fengguang
  2011-01-13 19:26         ` Peter Zijlstra
  1 sibling, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2011-01-13  3:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Rik van Riel, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, KOSAKI Motohiro, Greg Thelen, Minchan Kim, linux-mm,
	linux-fsdevel, LKML

> sigh.. I've been hassled a lot by the possible disharmonies between
> the bdi/global dirty limits.
> 
> One example is the below graph, where the bdi dirty pages are
> constantly exceeding the bdi dirty limit. The root cause is,
> "(dirty + background) / 2" may be close to or even exceed
> bdi_dirty_limit. 

When exceeded, the task will not get throttled at all at some time,
and get hard throttled at other times.

> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/256M/ext3-2dd-1M-8p-191M-2.6.37-rc5+-2010-12-09-13-42/dirty-pages-200.png

This graph is more obvious. However I'm no longer sure they are the
exact graphs that are caused by "(dirty + background) / 2 > bdi_dirty_limit",
which evaluates to TRUE after I do "[PATCH 02/35] writeback: safety
margin for bdi stat error", which lowered bdi_dirty_limit by 1-2MB in
that test case.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/256M/btrfs-1dd-1M-8p-191M-2.6.37-rc5+-2010-12-09-14-35/dirty-pages-200.png

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 01/35] writeback: enabling gate limit for light dirtied bdi
@ 2011-01-13  3:58         ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2011-01-13  3:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Rik van Riel, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, KOSAKI Motohiro, Greg Thelen, Minchan Kim, linux-mm,
	linux-fsdevel, LKML

> sigh.. I've been hassled a lot by the possible disharmonies between
> the bdi/global dirty limits.
> 
> One example is the below graph, where the bdi dirty pages are
> constantly exceeding the bdi dirty limit. The root cause is,
> "(dirty + background) / 2" may be close to or even exceed
> bdi_dirty_limit. 

When exceeded, the task will not get throttled at all at some time,
and get hard throttled at other times.

> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/256M/ext3-2dd-1M-8p-191M-2.6.37-rc5+-2010-12-09-13-42/dirty-pages-200.png

This graph is more obvious. However I'm no longer sure they are the
exact graphs that are caused by "(dirty + background) / 2 > bdi_dirty_limit",
which evaluates to TRUE after I do "[PATCH 02/35] writeback: safety
margin for bdi stat error", which lowered bdi_dirty_limit by 1-2MB in
that test case.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/256M/btrfs-1dd-1M-8p-191M-2.6.37-rc5+-2010-12-09-14-35/dirty-pages-200.png

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 02/35] writeback: safety margin for bdi stat error
  2011-01-12 21:59     ` Jan Kara
@ 2011-01-13  4:14       ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2011-01-13  4:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Thu, Jan 13, 2011 at 05:59:49AM +0800, Jan Kara wrote:
> On Mon 13-12-10 22:46:48, Wu Fengguang wrote:
> > In a simple dd test on a 8p system with "mem=256M", I find all light
> > dirtier tasks on the root fs are get heavily throttled. That happens
> > because the global limit is exceeded. It's unbelievable at first sight,
> > because the test fs doing the heavy dd is under its bdi limit.  After
> > doing some tracing, it's discovered that
> > 
> >         bdi_dirty < bdi_dirty_limit() < global_dirty_limit() < nr_dirty
>           ^^ bdi_dirty is the number of pages dirtied on BDI? I.e.
> bdi_nr_reclaimable + bdi_nr_writeback?

Yes.

> > So the root cause is, the bdi_dirty is well under the global nr_dirty
> > due to accounting errors. This can be fixed by using bdi_stat_sum(),
>   So which statistic had the big error? I'd just like to understand
> this (and how come your patch improves the situation)...

bdi_stat_error() = nr_cpu_ids * BDI_STAT_BATCH
                 = 8 * (8*(1+ilog2(8)))
                 = 8 * 8 * 4
                 = 256 pages
                 = 1MB

> > however that's costly on large NUMA machines. So do a less costly fix
> > of lowering the bdi limit, so that the accounting errors won't lead to
> > the absurd situation "global limit exceeded but bdi limit not exceeded".
> > 
> > This provides guarantee when there is only 1 heavily dirtied bdi, and
> > works by opportunity for 2+ heavy dirtied bdi's (hopefully they won't
> > reach big error _and_ exceed their bdi limit at the same time).
> > 
> ...
> > @@ -458,6 +464,14 @@ unsigned long bdi_dirty_limit(struct bac
> >  	long numerator, denominator;
> >  
> >  	/*
> > +	 * try to prevent "global limit exceeded but bdi limit not exceeded"
> > +	 */
> > +	if (likely(dirty > bdi_stat_error(bdi)))
> > +		dirty -= bdi_stat_error(bdi);
> > +	else
> > +		return 0;
> > +
>   Ugh, so if by any chance global_dirty_limit() <= bdi_stat_error(bdi), you
> will limit number of unreclaimable pages for that bdi 0? Why?

Good catch! Yeah it may lead to regressions and should be voided.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 02/35] writeback: safety margin for bdi stat error
@ 2011-01-13  4:14       ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2011-01-13  4:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Thu, Jan 13, 2011 at 05:59:49AM +0800, Jan Kara wrote:
> On Mon 13-12-10 22:46:48, Wu Fengguang wrote:
> > In a simple dd test on a 8p system with "mem=256M", I find all light
> > dirtier tasks on the root fs are get heavily throttled. That happens
> > because the global limit is exceeded. It's unbelievable at first sight,
> > because the test fs doing the heavy dd is under its bdi limit.  After
> > doing some tracing, it's discovered that
> > 
> >         bdi_dirty < bdi_dirty_limit() < global_dirty_limit() < nr_dirty
>           ^^ bdi_dirty is the number of pages dirtied on BDI? I.e.
> bdi_nr_reclaimable + bdi_nr_writeback?

Yes.

> > So the root cause is, the bdi_dirty is well under the global nr_dirty
> > due to accounting errors. This can be fixed by using bdi_stat_sum(),
>   So which statistic had the big error? I'd just like to understand
> this (and how come your patch improves the situation)...

bdi_stat_error() = nr_cpu_ids * BDI_STAT_BATCH
                 = 8 * (8*(1+ilog2(8)))
                 = 8 * 8 * 4
                 = 256 pages
                 = 1MB

> > however that's costly on large NUMA machines. So do a less costly fix
> > of lowering the bdi limit, so that the accounting errors won't lead to
> > the absurd situation "global limit exceeded but bdi limit not exceeded".
> > 
> > This provides guarantee when there is only 1 heavily dirtied bdi, and
> > works by opportunity for 2+ heavy dirtied bdi's (hopefully they won't
> > reach big error _and_ exceed their bdi limit at the same time).
> > 
> ...
> > @@ -458,6 +464,14 @@ unsigned long bdi_dirty_limit(struct bac
> >  	long numerator, denominator;
> >  
> >  	/*
> > +	 * try to prevent "global limit exceeded but bdi limit not exceeded"
> > +	 */
> > +	if (likely(dirty > bdi_stat_error(bdi)))
> > +		dirty -= bdi_stat_error(bdi);
> > +	else
> > +		return 0;
> > +
>   Ugh, so if by any chance global_dirty_limit() <= bdi_stat_error(bdi), you
> will limit number of unreclaimable pages for that bdi 0? Why?

Good catch! Yeah it may lead to regressions and should be voided.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 02/35] writeback: safety margin for bdi stat error
  2011-01-13  4:14       ` Wu Fengguang
@ 2011-01-13 10:38         ` Jan Kara
  -1 siblings, 0 replies; 202+ messages in thread
From: Jan Kara @ 2011-01-13 10:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Thu 13-01-11 12:14:40, Wu Fengguang wrote:
> On Thu, Jan 13, 2011 at 05:59:49AM +0800, Jan Kara wrote:
> > > So the root cause is, the bdi_dirty is well under the global nr_dirty
> > > due to accounting errors. This can be fixed by using bdi_stat_sum(),
> >   So which statistic had the big error? I'd just like to understand
> > this (and how come your patch improves the situation)...
> 
> bdi_stat_error() = nr_cpu_ids * BDI_STAT_BATCH
>                  = 8 * (8*(1+ilog2(8)))
>                  = 8 * 8 * 4
>                  = 256 pages
>                  = 1MB
  Yes, my question was more aiming at on which statistics the error happens
so that it causes problems for you. Thinking about it now I suppose you
observe that bdi_nr_writeback + bdi_nr_reclaimable < bdi_thresh but in fact
the number of pages is higher than bdi_thresh because of accounting errors.
And thus we are able to reach global dirty limit and the tasks get
throttled heavily. Am I right?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 02/35] writeback: safety margin for bdi stat error
@ 2011-01-13 10:38         ` Jan Kara
  0 siblings, 0 replies; 202+ messages in thread
From: Jan Kara @ 2011-01-13 10:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Thu 13-01-11 12:14:40, Wu Fengguang wrote:
> On Thu, Jan 13, 2011 at 05:59:49AM +0800, Jan Kara wrote:
> > > So the root cause is, the bdi_dirty is well under the global nr_dirty
> > > due to accounting errors. This can be fixed by using bdi_stat_sum(),
> >   So which statistic had the big error? I'd just like to understand
> > this (and how come your patch improves the situation)...
> 
> bdi_stat_error() = nr_cpu_ids * BDI_STAT_BATCH
>                  = 8 * (8*(1+ilog2(8)))
>                  = 8 * 8 * 4
>                  = 256 pages
>                  = 1MB
  Yes, my question was more aiming at on which statistics the error happens
so that it causes problems for you. Thinking about it now I suppose you
observe that bdi_nr_writeback + bdi_nr_reclaimable < bdi_thresh but in fact
the number of pages is higher than bdi_thresh because of accounting errors.
And thus we are able to reach global dirty limit and the tasks get
throttled heavily. Am I right?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 02/35] writeback: safety margin for bdi stat error
  2011-01-13 10:38         ` Jan Kara
@ 2011-01-13 10:41           ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2011-01-13 10:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Thu, Jan 13, 2011 at 06:38:34PM +0800, Jan Kara wrote:
> On Thu 13-01-11 12:14:40, Wu Fengguang wrote:
> > On Thu, Jan 13, 2011 at 05:59:49AM +0800, Jan Kara wrote:
> > > > So the root cause is, the bdi_dirty is well under the global nr_dirty
> > > > due to accounting errors. This can be fixed by using bdi_stat_sum(),
> > >   So which statistic had the big error? I'd just like to understand
> > > this (and how come your patch improves the situation)...
> > 
> > bdi_stat_error() = nr_cpu_ids * BDI_STAT_BATCH
> >                  = 8 * (8*(1+ilog2(8)))
> >                  = 8 * 8 * 4
> >                  = 256 pages
> >                  = 1MB
>   Yes, my question was more aiming at on which statistics the error happens
> so that it causes problems for you. Thinking about it now I suppose you
> observe that bdi_nr_writeback + bdi_nr_reclaimable < bdi_thresh but in fact
> the number of pages is higher than bdi_thresh because of accounting errors.
> And thus we are able to reach global dirty limit and the tasks get
> throttled heavily. Am I right?

Yes, exactly.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 02/35] writeback: safety margin for bdi stat error
@ 2011-01-13 10:41           ` Wu Fengguang
  0 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2011-01-13 10:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Peter Zijlstra, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

On Thu, Jan 13, 2011 at 06:38:34PM +0800, Jan Kara wrote:
> On Thu 13-01-11 12:14:40, Wu Fengguang wrote:
> > On Thu, Jan 13, 2011 at 05:59:49AM +0800, Jan Kara wrote:
> > > > So the root cause is, the bdi_dirty is well under the global nr_dirty
> > > > due to accounting errors. This can be fixed by using bdi_stat_sum(),
> > >   So which statistic had the big error? I'd just like to understand
> > > this (and how come your patch improves the situation)...
> > 
> > bdi_stat_error() = nr_cpu_ids * BDI_STAT_BATCH
> >                  = 8 * (8*(1+ilog2(8)))
> >                  = 8 * 8 * 4
> >                  = 256 pages
> >                  = 1MB
>   Yes, my question was more aiming at on which statistics the error happens
> so that it causes problems for you. Thinking about it now I suppose you
> observe that bdi_nr_writeback + bdi_nr_reclaimable < bdi_thresh but in fact
> the number of pages is higher than bdi_thresh because of accounting errors.
> And thus we are able to reach global dirty limit and the tasks get
> throttled heavily. Am I right?

Yes, exactly.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 01/35] writeback: enabling gate limit for light dirtied bdi
  2011-01-13  3:44     ` Wu Fengguang
  2011-01-13  3:58         ` Wu Fengguang
@ 2011-01-13 19:26         ` Peter Zijlstra
  1 sibling, 0 replies; 202+ messages in thread
From: Peter Zijlstra @ 2011-01-13 19:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Rik van Riel, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, KOSAKI Motohiro, Greg Thelen, Minchan Kim, linux-mm,
	linux-fsdevel, LKML

On Thu, 2011-01-13 at 11:44 +0800, Wu Fengguang wrote:
> When testing 10-disk JBOD setup, I
> find that bdi_dirty_limit fluctuations too much. So I'm considering
> use global_dirty_limit as control target. 

Is this because the bandwidth is equal or larger than the dirty period?



^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 01/35] writeback: enabling gate limit for light dirtied bdi
@ 2011-01-13 19:26         ` Peter Zijlstra
  0 siblings, 0 replies; 202+ messages in thread
From: Peter Zijlstra @ 2011-01-13 19:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Rik van Riel, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, KOSAKI Motohiro, Greg Thelen, Minchan Kim, linux-mm,
	linux-fsdevel, LKML

On Thu, 2011-01-13 at 11:44 +0800, Wu Fengguang wrote:
> When testing 10-disk JBOD setup, I
> find that bdi_dirty_limit fluctuations too much. So I'm considering
> use global_dirty_limit as control target. 

Is this because the bandwidth is equal or larger than the dirty period?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 01/35] writeback: enabling gate limit for light dirtied bdi
@ 2011-01-13 19:26         ` Peter Zijlstra
  0 siblings, 0 replies; 202+ messages in thread
From: Peter Zijlstra @ 2011-01-13 19:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Rik van Riel, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, KOSAKI Motohiro, Greg Thelen, Minchan Kim, linux-mm,
	linux-fsdevel, LKML

On Thu, 2011-01-13 at 11:44 +0800, Wu Fengguang wrote:
> When testing 10-disk JBOD setup, I
> find that bdi_dirty_limit fluctuations too much. So I'm considering
> use global_dirty_limit as control target. 

Is this because the bandwidth is equal or larger than the dirty period?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH 01/35] writeback: enabling gate limit for light dirtied bdi
  2011-01-13 19:26         ` Peter Zijlstra
  (?)
  (?)
@ 2011-01-14  3:21         ` Wu Fengguang
  -1 siblings, 0 replies; 202+ messages in thread
From: Wu Fengguang @ 2011-01-14  3:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jan Kara, Andrew Morton, Rik van Riel, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, KOSAKI Motohiro, Greg Thelen, Minchan Kim, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: Type: text/plain, Size: 1767 bytes --]

On Fri, Jan 14, 2011 at 03:26:10AM +0800, Peter Zijlstra wrote:
> On Thu, 2011-01-13 at 11:44 +0800, Wu Fengguang wrote:
> > When testing 10-disk JBOD setup, I
> > find that bdi_dirty_limit fluctuations too much. So I'm considering
> > use global_dirty_limit as control target. 
> 
> Is this because the bandwidth is equal or larger than the dirty period?

The patchset will call ->writepages(N) with
N=rounddown_pow_of_two(bdi->write_bandwidth). XFS will then typically
do endio batches in the same amount. I see in practice XFS's
xfs_end_io() work get queued and executed ~2 times per second,
normally clearing 32MB worth of PG_writeback. I guess this is one
major source of fluctuation.

The attached XFS graphs can confirm this. The "written" and
"writeback" curves are skipping at 32MB size.

As for the dirty period,

        calc_period_shift()
        = 2 + ilog2(dirty_total - 1)
        = 2 + ilog2(380000)             # a 8GB test box, 20% dirty_ratio
        = 19

So period = (1 << 18) = 256k pages = 1GB. It's much larger than 32MB.
(Please correct me if wrong).

The problem is not limited to XFS. ext2/ext3/ext4 are also fluctuating
in a range up to bdi->write_bandwidth.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-10HDD-JBOD/ext2-fio-jbod-sync-128k-24p-15977M-2.6.37-rc8-dt5+-2010-12-31-19-36/balance_dirty_pages-pages.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-10HDD-JBOD/ext4_wb-fio-jbod-sync-128k-24p-15977M-2.6.37-rc8-dt5+-2010-12-31-12-24/balance_dirty_pages-pages.png

I noticed (ext2/ext3 graphs attached) that they are clearing
PG_writeback in much smaller batches at least in 1-disk case.
However the writeback pages can go low for 1-2 times in every 10
seconds.

Thanks,
Fengguang

[-- Attachment #2: xfs-1dd-1M-1p-2970M-global_dirty_state-500.png --]
[-- Type: image/png, Size: 55470 bytes --]

[-- Attachment #3: xfs-2dd-1M-1p-2970M-global_dirtied_written-500.png --]
[-- Type: image/png, Size: 49963 bytes --]

[-- Attachment #4: ext3-1dd-1M-1p-2970M-global_dirty_state-500.png --]
[-- Type: image/png, Size: 114311 bytes --]

[-- Attachment #5: ext2-1dd-1M-1p-2970M-global_dirtied_written-500.png --]
[-- Type: image/png, Size: 53505 bytes --]

^ permalink raw reply	[flat|nested] 202+ messages in thread

end of thread, other threads:[~2011-01-14  3:21 UTC | newest]

Thread overview: 202+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-13 14:46 [PATCH 00/35] IO-less dirty throttling v4 Wu Fengguang
2010-12-13 14:46 ` Wu Fengguang
2010-12-13 14:46 ` Wu Fengguang
2010-12-13 14:46 ` [PATCH 01/35] writeback: enabling gate limit for light dirtied bdi Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2011-01-12 21:43   ` Jan Kara
2011-01-12 21:43     ` Jan Kara
2011-01-13  3:44     ` Wu Fengguang
2011-01-13  3:58       ` Wu Fengguang
2011-01-13  3:58         ` Wu Fengguang
2011-01-13 19:26       ` Peter Zijlstra
2011-01-13 19:26         ` Peter Zijlstra
2011-01-13 19:26         ` Peter Zijlstra
2011-01-14  3:21         ` Wu Fengguang
2010-12-13 14:46 ` [PATCH 02/35] writeback: safety margin for bdi stat error Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2011-01-12 21:59   ` Jan Kara
2011-01-12 21:59     ` Jan Kara
2011-01-13  4:14     ` Wu Fengguang
2011-01-13  4:14       ` Wu Fengguang
2011-01-13 10:38       ` Jan Kara
2011-01-13 10:38         ` Jan Kara
2011-01-13 10:41         ` Wu Fengguang
2011-01-13 10:41           ` Wu Fengguang
2010-12-13 14:46 ` [PATCH 03/35] writeback: prevent duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2010-12-13 14:46 ` [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2010-12-14 13:37   ` Richard Kennedy
2010-12-14 13:37     ` Richard Kennedy
2010-12-14 13:59     ` Wu Fengguang
2010-12-14 13:59       ` Wu Fengguang
2010-12-14 14:33       ` Wu Fengguang
2010-12-14 14:33         ` Wu Fengguang
2010-12-14 14:39         ` Wu Fengguang
2010-12-14 14:39           ` Wu Fengguang
2010-12-14 14:50           ` Peter Zijlstra
2010-12-14 14:50             ` Peter Zijlstra
2010-12-14 14:50             ` Peter Zijlstra
2010-12-14 15:15             ` Wu Fengguang
2010-12-14 15:26               ` Wu Fengguang
2010-12-14 14:56           ` Wu Fengguang
2010-12-14 14:56             ` Wu Fengguang
2010-12-15 18:48       ` Richard Kennedy
2010-12-15 18:48         ` Richard Kennedy
2010-12-15 18:48         ` Richard Kennedy
2010-12-17 13:07         ` Wu Fengguang
2010-12-13 14:46 ` [PATCH 05/35] writeback: IO-less balance_dirty_pages() Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2010-12-13 14:46 ` [PATCH 06/35] writeback: consolidate variable names in balance_dirty_pages() Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2010-12-13 14:46 ` [PATCH 07/35] writeback: per-task rate limit on balance_dirty_pages() Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2010-12-13 14:46 ` [PATCH 08/35] writeback: user space think time compensation Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2010-12-13 14:46 ` [PATCH 09/35] writeback: account per-bdi accumulated written pages Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2010-12-13 14:46 ` [PATCH 10/35] writeback: bdi write bandwidth estimation Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2010-12-13 14:46 ` [PATCH 11/35] writeback: show bdi write bandwidth in debugfs Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2010-12-13 14:46 ` [PATCH 12/35] writeback: scale down max throttle bandwidth on concurrent dirtiers Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2010-12-14  1:21   ` Yan, Zheng
2010-12-14  1:21     ` Yan, Zheng
2010-12-14  7:00     ` Wu Fengguang
2010-12-14  7:00       ` Wu Fengguang
2010-12-14  7:00       ` Wu Fengguang
2010-12-14 13:00       ` Wu Fengguang
2010-12-14 13:00         ` Wu Fengguang
2010-12-13 14:46 ` [PATCH 13/35] writeback: bdi base throttle bandwidth Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2010-12-13 14:46   ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 14/35] writeback: smoothed bdi dirty pages Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 15/35] writeback: adapt max balance pause time to memory size Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 16/35] writeback: increase min pause time on concurrent dirtiers Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 18:23   ` Valdis.Kletnieks
2010-12-14  6:51     ` Wu Fengguang
2010-12-14  6:51       ` Wu Fengguang
2010-12-14 18:42       ` Valdis.Kletnieks
2010-12-14 18:55         ` Peter Zijlstra
2010-12-14 18:55           ` Peter Zijlstra
2010-12-14 18:55           ` Peter Zijlstra
2010-12-14 20:13           ` Valdis.Kletnieks
2010-12-14 20:24             ` Peter Zijlstra
2010-12-14 20:24               ` Peter Zijlstra
2010-12-14 20:24               ` Peter Zijlstra
2010-12-14 20:37               ` Valdis.Kletnieks
2010-12-13 14:47 ` [PATCH 17/35] writeback: quit throttling when bdi dirty pages dropped low Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-16  5:17   ` Wu Fengguang
2010-12-16  5:17     ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 18/35] writeback: start background writeback earlier Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-16  5:37   ` Wu Fengguang
2010-12-16  5:37     ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 19/35] writeback: make nr_to_write a per-file limit Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 20/35] writeback: scale IO chunk size up to device bandwidth Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 21/35] writeback: trace balance_dirty_pages() Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 22/35] writeback: trace global dirty page states Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-17  2:19   ` Wu Fengguang
2010-12-17  2:19     ` Wu Fengguang
2010-12-17  3:11     ` Wu Fengguang
2010-12-17  3:11       ` Wu Fengguang
2010-12-17  6:52     ` Hugh Dickins
2010-12-17  6:52       ` Hugh Dickins
2010-12-17  9:31       ` Wu Fengguang
2010-12-17  9:31         ` Wu Fengguang
2010-12-17 11:21       ` [PATCH] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
2010-12-17 11:21         ` Wu Fengguang
2010-12-17 14:21         ` Rik van Riel
2010-12-17 14:21           ` Rik van Riel
2010-12-17 15:34         ` Minchan Kim
2010-12-17 15:34           ` Minchan Kim
2010-12-17 15:42           ` Minchan Kim
2010-12-17 15:42             ` Minchan Kim
2010-12-21  5:59         ` Hugh Dickins
2010-12-21  5:59           ` Hugh Dickins
2010-12-21  9:39           ` Wu Fengguang
2010-12-21  9:39             ` Wu Fengguang
2010-12-30  3:15             ` Hugh Dickins
2010-12-30  3:15               ` Hugh Dickins
2010-12-13 14:47 ` [PATCH 23/35] writeback: trace writeback_single_inode() Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 24/35] btrfs: dont call balance_dirty_pages_ratelimited() on already dirty pages Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 25/35] btrfs: lower the dirty balacing rate limit Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 26/35] btrfs: wait on too many nr_async_bios Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 27/35] nfs: livelock prevention is now done in VFS Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 28/35] nfs: writeback pages wait queue Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 29/35] nfs: in-commit pages accounting and " Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 21:15   ` Trond Myklebust
2010-12-13 21:15     ` Trond Myklebust
2010-12-14 15:40     ` Wu Fengguang
2010-12-14 15:40       ` Wu Fengguang
2010-12-14 15:57       ` Trond Myklebust
2010-12-14 15:57         ` Trond Myklebust
2010-12-15 15:07         ` Wu Fengguang
2010-12-15 15:07           ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 30/35] nfs: heuristics to avoid commit Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 20:53   ` Trond Myklebust
2010-12-13 20:53     ` Trond Myklebust
2010-12-14  8:20     ` Wu Fengguang
2010-12-14  8:20       ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 31/35] nfs: dont change wbc->nr_to_write in write_inode() Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 21:01   ` Trond Myklebust
2010-12-13 21:01     ` Trond Myklebust
2010-12-14 15:53     ` Wu Fengguang
2010-12-14 15:53       ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 32/35] nfs: limit the range of commits Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 33/35] nfs: adapt congestion threshold to dirty threshold Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 34/35] nfs: trace nfs_commit_unstable_pages() Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 35/35] nfs: trace nfs_commit_release() Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
2010-12-13 14:47   ` Wu Fengguang
     [not found] ` <AANLkTinFeu7LMaDFgUcP3r2oqVHE5bei3T5JTPGBNvS9@mail.gmail.com>
2010-12-14  4:59   ` [PATCH 00/35] IO-less dirty throttling v4 Wu Fengguang
2010-12-14  4:59     ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.