[PATCH 00/17] [RFC] soft and dynamic dirty throttling limits

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
@ 2010-09-12 15:49 ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Andrew Morton, Theodore Ts'o, Dave Chinner, Jan Kara,
	Peter Zijlstra, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Chris Mason, Christoph Hellwig, Li Shaohua, Wu Fengguang

The basic idea is to introduce a small region under the bdi dirty threshold.
The task will be throttled gently when stepping into the bottom of region,
and get throttled more and more aggressively as bdi dirty+writeback pages
goes up closer to the top of region. At some point the application will be
throttled at the right bandwidth that balances with the device write bandwidth.
(the 2nd patch has more details)

The first two patch groups introduce two building blocks..

    IO-less balance_dirty_pages()
	[PATCH 02/17] writeback: IO-less balance_dirty_pages()
	[PATCH 03/17] writeback: per-task rate limit to balance_dirty_pages()
	[PATCH 04/17] writeback: quit throttling when bdi dirty/writeback pages go down            
	[PATCH 05/17] writeback: quit throttling when signal pending
    (trace event)
	[PATCH 06/17] writeback: move task dirty fraction to balance_dirty_pages()
	[PATCH 07/17] writeback: add trace event for balance_dirty_pages()

    bandwidth estimation
	[PATCH 08/17] writeback: account per-bdi accumulated written pages
	[PATCH 09/17] writeback: bdi write bandwidth estimation
	[PATCH 10/17] writeback: show bdi write bandwidth in debugfs

..for use by the next two features:

    larger nr_to_write (hence IO size)
	[PATCH 11/17] writeback: make nr_to_write a per-file limit
	[PATCH 12/17] writeback: scale IO chunk size up to device bandwidth

    dynamic dirty pages limit
	[PATCH 14/17] vmscan: add scan_control.priority
	[PATCH 15/17] mm: lower soft dirty limits on memory pressure
	[PATCH 16/17] mm: create /vm/dirty_pressure in debugfs

The following two patches can be merged independently indeed.

    change of rules
	[PATCH 01/17] writeback: remove the internal 5% low bound on dirty_ratio
	[PATCH 13/17] writeback: reduce per-bdi dirty threshold ramp up time

And this cleanup reflects a late thought, it would better to be moved to the
beginning of the patch series..

    cleanup
	[PATCH 17/17] writeback: consolidate balance_dirty_pages() variable names


 fs/fs-writeback.c                |   49 ++++-
 include/linux/backing-dev.h      |    2
 include/linux/sched.h            |    7
 include/linux/writeback.h        |   17 +
 include/trace/events/writeback.h |   47 +++++
 mm/backing-dev.c                 |   29 +--
 mm/page-writeback.c              |  248 ++++++++++++++++-------------
 mm/vmscan.c                      |   22 ++
 mm/vmstat.c                      |   29 +++
 9 files changed, 311 insertions(+), 139 deletions(-)

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
@ 2010-09-12 15:49 ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Andrew Morton, Theodore Ts'o, Dave Chinner, Jan Kara,
	Peter Zijlstra, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Chris Mason, Christoph Hellwig, Li Shaohua, Wu Fengguang

The basic idea is to introduce a small region under the bdi dirty threshold.
The task will be throttled gently when stepping into the bottom of region,
and get throttled more and more aggressively as bdi dirty+writeback pages
goes up closer to the top of region. At some point the application will be
throttled at the right bandwidth that balances with the device write bandwidth.
(the 2nd patch has more details)

The first two patch groups introduce two building blocks..

    IO-less balance_dirty_pages()
	[PATCH 02/17] writeback: IO-less balance_dirty_pages()
	[PATCH 03/17] writeback: per-task rate limit to balance_dirty_pages()
	[PATCH 04/17] writeback: quit throttling when bdi dirty/writeback pages go down            
	[PATCH 05/17] writeback: quit throttling when signal pending
    (trace event)
	[PATCH 06/17] writeback: move task dirty fraction to balance_dirty_pages()
	[PATCH 07/17] writeback: add trace event for balance_dirty_pages()

    bandwidth estimation
	[PATCH 08/17] writeback: account per-bdi accumulated written pages
	[PATCH 09/17] writeback: bdi write bandwidth estimation
	[PATCH 10/17] writeback: show bdi write bandwidth in debugfs

..for use by the next two features:

    larger nr_to_write (hence IO size)
	[PATCH 11/17] writeback: make nr_to_write a per-file limit
	[PATCH 12/17] writeback: scale IO chunk size up to device bandwidth

    dynamic dirty pages limit
	[PATCH 14/17] vmscan: add scan_control.priority
	[PATCH 15/17] mm: lower soft dirty limits on memory pressure
	[PATCH 16/17] mm: create /vm/dirty_pressure in debugfs

The following two patches can be merged independently indeed.

    change of rules
	[PATCH 01/17] writeback: remove the internal 5% low bound on dirty_ratio
	[PATCH 13/17] writeback: reduce per-bdi dirty threshold ramp up time

And this cleanup reflects a late thought, it would better to be moved to the
beginning of the patch series..

    cleanup
	[PATCH 17/17] writeback: consolidate balance_dirty_pages() variable names


 fs/fs-writeback.c                |   49 ++++-
 include/linux/backing-dev.h      |    2
 include/linux/sched.h            |    7
 include/linux/writeback.h        |   17 +
 include/trace/events/writeback.h |   47 +++++
 mm/backing-dev.c                 |   29 +--
 mm/page-writeback.c              |  248 ++++++++++++++++-------------
 mm/vmscan.c                      |   22 ++
 mm/vmstat.c                      |   29 +++
 9 files changed, 311 insertions(+), 139 deletions(-)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 01/17] writeback: remove the internal 5% low bound on dirty_ratio
  2010-09-12 15:49 ` Wu Fengguang
@ 2010-09-12 15:49   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Jan Kara, Peter Zijlstra, Wu Fengguang, Andrew Morton,
	Theodore Ts'o, Dave Chinner, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

[-- Attachment #1: writeback-remove-dirty_ratio-low-bound.patch --]
[-- Type: text/plain, Size: 2900 bytes --]

The dirty_ratio was siliently limited in global_dirty_limits() to >= 5%.
This is not a user expected behavior. And it's inconsistent with
calc_period_shift(), which uses the plain vm_dirty_ratio value.

Let's rip the arbitrary internal bound. It may impact some very weird
user space applications. However we are going to dynamicly sizing the
dirty limits anyway, which may well break such applications, too.

At the same time, fix balance_dirty_pages() to work with the
dirty_thresh=0 case. This allows applications to proceed when
dirty+writeback pages are all cleaned.

And ">" fits with the name "exceeded" better than ">=" does. Neil
think it is an aesthetic improvement as well as a functional one :)

CC: Jan Kara <jack@suse.cz>
Proposed-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c   |    2 +-
 mm/page-writeback.c |   16 +++++-----------
 2 files changed, 6 insertions(+), 12 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-08-29 08:10:30.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-08-29 08:12:08.000000000 +0800
@@ -415,14 +415,8 @@ void global_dirty_limits(unsigned long *
 
 	if (vm_dirty_bytes)
 		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
-	else {
-		int dirty_ratio;
-
-		dirty_ratio = vm_dirty_ratio;
-		if (dirty_ratio < 5)
-			dirty_ratio = 5;
-		dirty = (dirty_ratio * available_memory) / 100;
-	}
+	else
+		dirty = (vm_dirty_ratio * available_memory) / 100;
 
 	if (dirty_background_bytes)
 		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
@@ -510,7 +504,7 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_reclaimable + nr_writeback <
+		if (nr_reclaimable + nr_writeback <=
 				(background_thresh + dirty_thresh) / 2)
 			break;
 
@@ -542,8 +536,8 @@ static void balance_dirty_pages(struct a
 		 * the last resort safeguard.
 		 */
 		dirty_exceeded =
-			(bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
-			|| (nr_reclaimable + nr_writeback >= dirty_thresh);
+			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
+			|| (nr_reclaimable + nr_writeback > dirty_thresh);
 
 		if (!dirty_exceeded)
 			break;
--- linux-next.orig/fs/fs-writeback.c	2010-08-29 08:12:51.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-08-29 08:12:53.000000000 +0800
@@ -574,7 +574,7 @@ static inline bool over_bground_thresh(v
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 
 	return (global_page_state(NR_FILE_DIRTY) +
-		global_page_state(NR_UNSTABLE_NFS) >= background_thresh);
+		global_page_state(NR_UNSTABLE_NFS) > background_thresh);
 }
 
 /*



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 01/17] writeback: remove the internal 5% low bound on dirty_ratio
@ 2010-09-12 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Jan Kara, Peter Zijlstra, Wu Fengguang, Andrew Morton,
	Theodore Ts'o, Dave Chinner, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

[-- Attachment #1: writeback-remove-dirty_ratio-low-bound.patch --]
[-- Type: text/plain, Size: 3125 bytes --]

The dirty_ratio was siliently limited in global_dirty_limits() to >= 5%.
This is not a user expected behavior. And it's inconsistent with
calc_period_shift(), which uses the plain vm_dirty_ratio value.

Let's rip the arbitrary internal bound. It may impact some very weird
user space applications. However we are going to dynamicly sizing the
dirty limits anyway, which may well break such applications, too.

At the same time, fix balance_dirty_pages() to work with the
dirty_thresh=0 case. This allows applications to proceed when
dirty+writeback pages are all cleaned.

And ">" fits with the name "exceeded" better than ">=" does. Neil
think it is an aesthetic improvement as well as a functional one :)

CC: Jan Kara <jack@suse.cz>
Proposed-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c   |    2 +-
 mm/page-writeback.c |   16 +++++-----------
 2 files changed, 6 insertions(+), 12 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-08-29 08:10:30.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-08-29 08:12:08.000000000 +0800
@@ -415,14 +415,8 @@ void global_dirty_limits(unsigned long *
 
 	if (vm_dirty_bytes)
 		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
-	else {
-		int dirty_ratio;
-
-		dirty_ratio = vm_dirty_ratio;
-		if (dirty_ratio < 5)
-			dirty_ratio = 5;
-		dirty = (dirty_ratio * available_memory) / 100;
-	}
+	else
+		dirty = (vm_dirty_ratio * available_memory) / 100;
 
 	if (dirty_background_bytes)
 		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
@@ -510,7 +504,7 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_reclaimable + nr_writeback <
+		if (nr_reclaimable + nr_writeback <=
 				(background_thresh + dirty_thresh) / 2)
 			break;
 
@@ -542,8 +536,8 @@ static void balance_dirty_pages(struct a
 		 * the last resort safeguard.
 		 */
 		dirty_exceeded =
-			(bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
-			|| (nr_reclaimable + nr_writeback >= dirty_thresh);
+			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
+			|| (nr_reclaimable + nr_writeback > dirty_thresh);
 
 		if (!dirty_exceeded)
 			break;
--- linux-next.orig/fs/fs-writeback.c	2010-08-29 08:12:51.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-08-29 08:12:53.000000000 +0800
@@ -574,7 +574,7 @@ static inline bool over_bground_thresh(v
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 
 	return (global_page_state(NR_FILE_DIRTY) +
-		global_page_state(NR_UNSTABLE_NFS) >= background_thresh);
+		global_page_state(NR_UNSTABLE_NFS) > background_thresh);
 }
 
 /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 02/17] writeback: IO-less balance_dirty_pages()
  2010-09-12 15:49 ` Wu Fengguang
@ 2010-09-12 15:49   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Chris Mason, Dave Chinner, Jan Kara, Peter Zijlstra,
	Jens Axboe, Wu Fengguang, Andrew Morton, Theodore Ts'o,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Christoph Hellwig,
	Li Shaohua

[-- Attachment #1: writeback-bw-throttle.patch --]
[-- Type: text/plain, Size: 20832 bytes --]

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

This patch introduces the basic framework, which will be further
consolidated by the next patches.

RATIONALS
=========

The current balance_dirty_pages() is rather IO inefficient.

- concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- small nr_to_write for fast arrays

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  This limits current balance_dirty_pages() to small inefficient IOs.

For the above two reasons, it's much better to shift IO to the flusher
threads and let balance_dirty_pages() just wait for enough time or progress.

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. This is found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control

- the pause time in each balance_dirty_pages() invocations
- the number of pages dirtied before calling balance_dirty_pages()

for smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than  10ms, which burns CPU power)
- avoid too large pause time (more than 100ms, which hurts responsiveness)
- avoid big fluctuations of pause times

For example, when doing a simple cp on ext4 with mem=4G HZ=250.

before patch, the pause time fluctuates from 0 to 324ms
(and the stall time may grow very large for slow devices)

[ 1237.139962] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
[ 1237.207489] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.225190] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.234488] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.244692] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.375231] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1237.443035] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1237.574630] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1237.642394] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1237.666320] balance_dirty_pages: write_chunk=1536 pages_written=57 pause=5
[ 1237.973365] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=81
[ 1238.212626] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
[ 1238.280431] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1238.412029] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1238.412791] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0

after patch, the pause time remains stable around 32ms

cp-2687  [007]  1452.189182: balance_dirty_pages: bdi=8:0 weight=56% thresh=123892 gap=7700 dirtied=128 pause=8 bw=64494573
cp-2687  [007]  1452.198232: balance_dirty_pages: bdi=8:0 weight=56% thresh=123900 gap=7708 dirtied=128 pause=8 bw=64562234
cp-2687  [006]  1452.205170: balance_dirty_pages: bdi=8:0 weight=56% thresh=123907 gap=7715 dirtied=128 pause=8 bw=64613176
cp-2687  [006]  1452.213115: balance_dirty_pages: bdi=8:0 weight=56% thresh=123907 gap=7715 dirtied=128 pause=8 bw=64613829
cp-2687  [006]  1452.222154: balance_dirty_pages: bdi=8:0 weight=56% thresh=123908 gap=7716 dirtied=128 pause=8 bw=64622856
cp-2687  [002]  1452.229099: balance_dirty_pages: bdi=8:0 weight=56% thresh=123908 gap=7716 dirtied=128 pause=8 bw=64623508
cp-2687  [002]  1452.237012: balance_dirty_pages: bdi=8:0 weight=56% thresh=123915 gap=7723 dirtied=128 pause=8 bw=64682786
cp-2687  [002]  1452.246157: balance_dirty_pages: bdi=8:0 weight=56% thresh=123915 gap=7723 dirtied=128 pause=8 bw=64683437
cp-2687  [006]  1452.253043: balance_dirty_pages: bdi=8:0 weight=56% thresh=123922 gap=7730 dirtied=128 pause=8 bw=64734358
cp-2687  [006]  1452.261899: balance_dirty_pages: bdi=8:0 weight=57% thresh=123917 gap=7725 dirtied=128 pause=8 bw=64765323
cp-2687  [006]  1452.268939: balance_dirty_pages: bdi=8:0 weight=57% thresh=123924 gap=7732 dirtied=128 pause=8 bw=64816229
cp-2687  [002]  1452.276932: balance_dirty_pages: bdi=8:0 weight=57% thresh=123930 gap=7738 dirtied=128 pause=8 bw=64867113
cp-2687  [002]  1452.285889: balance_dirty_pages: bdi=8:0 weight=57% thresh=123931 gap=7739 dirtied=128 pause=8 bw=64876082


CONTROL SYSTEM
==============

The current task_dirty_limit() adjusts bdi_thresh according to the dirty
"weight" of the current task, which is the percent of pages recently
dirtied by the task. If 100% pages are recently dirtied by the task, it
will lower bdi_thresh by 1/8. If only 1% pages are dirtied by the task,
it will return almost unmodified bdi_thresh. In this way, a heavy
dirtier will get blocked at (bdi_thresh-bdi_thresh/8) while allowing a
light dirtier to progress (the latter won't be blocked because R << B in
fig.1).

Fig.1 before patch, a heavy dirtier and a light dirtier
                                                R
----------------------------------------------+-o---------------------------*--|
                                              L A                           B  T
  T: bdi_dirty_limit
  L: bdi_dirty_limit - bdi_dirty_limit/8

  R: bdi_reclaimable + bdi_writeback

  A: bdi_thresh for a heavy dirtier ~= R ~= L
  B: bdi_thresh for a light dirtier ~= T

If B is a newly started heavy dirtier, then it will slowly gain weight
and A will lose weight.  The bdi_thresh for A and B will be approaching
the center of region (L, T) and eventually stabilize there.

Fig.2 before patch, two heavy dirtiers converging to the same threshold
                                                             R
----------------------------------------------+--------------o-*---------------|
                                              L              A B               T

Now for IO-less balance_dirty_pages(), let's do it in a "bandwidth control"
way. In fig.3, a soft dirty limit region (L, A) is introduced. When R enters
this region, the task may be throttled for T seconds on every N pages it dirtied.
Let's call (N/T) the "throttle bandwidth". It is computed by the following fomula:

        throttle_bandwidth = bdi_bandwidth * (A - R) / (A - L)
where
        L = A - A/16
	A = T - T/16

So when there is only one heavy dirtier (fig.3),

        R ~= L
        throttle_bandwidth ~= bdi_bandwidth

It's a stable balance:
- when R > L, then throttle_bandwidth < bdi_bandwidth, so R will decrease to L
- when R < L, then throttle_bandwidth > bdi_bandwidth, so R will increase to L

Fig.3 after patch, one heavy dirtier

                                                |
    throttle_bandwidth ~= bdi_bandwidth  =>     o
                                                | o
                                                |   o
                                                |     o
                                                |       o
                                                |         o
                                              L |           o
----------------------------------------------+-+-------------o----------------|
                                                R             A                T
  T: bdi_dirty_limit
  A: task_dirty_limit = bdi_dirty_limit - bdi_dirty_limit/16
  L: task_dirty_limit - task_dirty_limit/16

  R: bdi_reclaimable + bdi_writeback ~= L

When there comes a new cp task, its weight will grow from 0 to 50%.
When the weight is still small, it's considered a light dirtier and it's
allowed to dirty pages much faster than the bdi write bandwidth. In fact
initially it won't be throttled at all when R < Lb where Lb=B-B/16 and B~=T.

Fig.4 after patch, an old cp + a newly started cp

                     (throttle bandwidth) =>    *
                                                | *
                                                |   *
                                                |     *
                                                |       *
                                                |         *
                                                |           *
                                                |             *
                      throttle bandwidth  =>    o               *
                                                | o               *
                                                |   o               *
                                                |     o               *
                                                |       o               *
                                                |         o               *
                                                |           o               *
------------------------------------------------+-------------o---------------*|
                                                R             A               BT

So R will quickly grow large (fig.5). As the two heavy dirtiers' weight
converge to 50%, the points A, B will go towards each other and
eventually become one in fig.5. R will stabilize around A-A/32 where
A=B=T-T/16. throttle_bandwidth will stabilize around bdi_bandwidth/2.
There won't be big oscillations between A and B, because as long as A
coincides with B, their throttle_bandwidth and dirtied pages will be
equal, A's weight will stop decreasing and B's weight will stop growing,
so the two points won't keep moving and cross each other. So it's a
pretty stable control system. The only problem is, it converges a bit
slow (except for really fast storage array).

Fig.5 after patch, the two heavy dirtiers converging to the same bandwidth

                                                         |
                                                         |
                                 throttle bandwidth  =>  *
                                                         | *
                                 throttle bandwidth  =>  o   *
                                                         | o   *
                                                         |   o   *
                                                         |     o   *
                                                         |       o   *
                                                         |         o   *
---------------------------------------------------------+-----------o---*-----|
                                                         R           A   B     T

Note that the application "think time" is ignored for simplicity in the
above discussions.  With non-zero user space think time, the balance
point will slightly drift and not a big deal otherwise.

PSEUDO CODE
===========

balance_dirty_pages():

	if (dirty_soft_thresh exceeded &&
	      bdi_soft_thresh exceeded)
		sleep (pages_dirtied / throttle_bandwidth)

	while (bdi_thresh exceeded) {
		sleep 200ms
		break if (bdi dirty/writeback pages) _dropped_ more than
			8 * (pages_dirtied by this task)
	}

	while (dirty_thresh exceeded)
		sleep 200ms

Basically there are three level of throttling now.

- normally the dirtier will be adaptively throttled with good timing

- when bdi_thresh is exceeded, the task will be throttled until bdi
  dirty/writeback pages go down reasonably large

- when dirty_thresh is exceeded, the task will be throttled for
  arbitrary long time

BENCHMARKS
==========

The test box has a 4-core 3.2GHz CPU, 4GB mem and a SATA disk.

For each filesystem, the following command is run 3 times.

time (dd if=/dev/zero of=/tmp/10G bs=1M count=10240; sync); rm /tmp/10G

	    2.6.36-rc2-mm1	2.6.36-rc2-mm1+balance_dirty_pages
average real time
ext2        236.377s            232.144s              -1.8%
ext3        226.245s            225.751s              -0.2%
ext4        178.742s            179.343s              +0.3%
xfs         183.562s            179.808s              -2.0%
btrfs       179.044s            179.461s              +0.2%
NFS         645.627s            628.937s              -2.6%

average system time
ext2         22.142s             19.656s             -11.2%
ext3         34.175s             32.462s              -5.0%
ext4         23.440s             21.162s              -9.7%
xfs          19.089s             16.069s             -15.8%
btrfs        12.212s             11.670s              -4.4%
NFS          16.807s             17.410s              +3.6%

total user time
sum           0.136s              0.084s             -38.2%

In a more recent run of the tests, it's in fact slightly slower.

ext2         49.500 MB/s         49.200 MB/s          -0.6%
ext3         50.133 MB/s         50.000 MB/s          -0.3%
ext4         64.000 MB/s         63.200 MB/s          -1.2%
xfs          63.500 MB/s         63.167 MB/s          -0.5%
btrfs        63.133 MB/s         63.033 MB/s          -0.2%
NFS          16.833 MB/s         16.867 MB/s          +0.2%

In general there are no big IO performance changes for desktop users,
except for some noticeable reduction of CPU overheads. It should
mainly benefit file servers with heavy concurrent writers on fast
storage arrays.

CC: Chris Mason <chris.mason@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |    9 +++
 mm/page-writeback.c       |   95 +++++++++++-------------------------
 2 files changed, 39 insertions(+), 65 deletions(-)

--- linux-next.orig/include/linux/writeback.h	2010-09-09 15:43:29.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-09-12 12:51:20.000000000 +0800
@@ -14,6 +14,15 @@ extern struct list_head inode_in_use;
 extern struct list_head inode_unused;
 
 /*
+ * The 1/16 region under the bdi dirty threshold is set aside for elastic
+ * throttling. In rare cases when the threshold is exceeded, more rigid
+ * throttling will be imposed, which will inevitably stall the dirtier task
+ * for seconds (or more) at _one_ time. The rare case could be a fork bomb
+ * where every new task dirties some more pages.
+ */
+#define DIRTY_SOFT_THROTTLE_RATIO	16
+
+/*
  * fs/fs-writeback.c
  */
 enum writeback_sync_modes {
--- linux-next.orig/mm/page-writeback.c	2010-09-09 15:43:29.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-09-12 13:18:08.000000000 +0800
@@ -42,20 +42,6 @@
  */
 static long ratelimit_pages = 32;
 
-/*
- * When balance_dirty_pages decides that the caller needs to perform some
- * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than dirtied pages to ensure that reasonably
- * large amounts of I/O are submitted.
- */
-static inline long sync_writeback_pages(unsigned long dirtied)
-{
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
-	return dirtied + dirtied / 2;
-}
-
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
@@ -279,7 +265,7 @@ static unsigned long task_dirty_limit(st
 {
 	long numerator, denominator;
 	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty >> 3;
+	u64 inv = dirty / DIRTY_SOFT_THROTTLE_RATIO;
 
 	task_dirties_fraction(tsk, &numerator, &denominator);
 	inv *= numerator;
@@ -473,26 +459,26 @@ unsigned long bdi_dirty_limit(struct bac
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
 	long nr_reclaimable, bdi_nr_reclaimable;
 	long nr_writeback, bdi_nr_writeback;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
+	unsigned long pause;
+	unsigned long gap;
+	unsigned long bw;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
 	for (;;) {
-		struct writeback_control wbc = {
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-			.range_cyclic	= 1,
-		};
-
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_writeback = global_page_state(NR_WRITEBACK);
@@ -529,6 +515,23 @@ static void balance_dirty_pages(struct a
 			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
+		if (bdi_nr_reclaimable + bdi_nr_writeback <=
+			bdi_thresh - bdi_thresh / DIRTY_SOFT_THROTTLE_RATIO)
+			goto check_exceeded;
+
+		gap = bdi_thresh > (bdi_nr_reclaimable + bdi_nr_writeback) ?
+		      bdi_thresh - (bdi_nr_reclaimable + bdi_nr_writeback) : 0;
+
+		bw = (100 << 20) * gap /
+				(bdi_thresh / DIRTY_SOFT_THROTTLE_RATIO + 1);
+
+		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
+		pause = clamp_val(pause, 1, HZ/5);
+
+		__set_current_state(TASK_INTERRUPTIBLE);
+		io_schedule_timeout(pause);
+
+check_exceeded:
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the
 		 * global "hard" limit. The former helps to prevent heavy IO
@@ -544,35 +547,6 @@ static void balance_dirty_pages(struct a
 
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
-
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_wbc_balance_dirty_start(&wbc, bdi);
-		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes_wb(&bdi->wb, &wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			trace_wbc_balance_dirty_written(&wbc, bdi);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
-		}
-		trace_wbc_balance_dirty_wait(&wbc, bdi);
-		__set_current_state(TASK_INTERRUPTIBLE);
-		io_schedule_timeout(pause);
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
 	}
 
 	if (!dirty_exceeded && bdi->dirty_exceeded)
@@ -581,16 +555,7 @@ static void balance_dirty_pages(struct a
 	if (writeback_in_progress(bdi))
 		return;
 
-	/*
-	 * In laptop mode, we wait until hitting the higher threshold before
-	 * starting background writeout, and then write out all the way down
-	 * to the lower threshold.  So slow writers cause minimal disk activity.
-	 *
-	 * In normal mode, we start background writeout at the lower
-	 * background_thresh, to keep the amount of dirty memory low.
-	 */
-	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	if (nr_reclaimable > background_thresh)
 		bdi_start_background_writeback(bdi);
 }
 
@@ -638,7 +603,7 @@ void balance_dirty_pages_ratelimited_nr(
 	p =  &__get_cpu_var(bdp_ratelimits);
 	*p += nr_pages_dirtied;
 	if (unlikely(*p >= ratelimit)) {
-		ratelimit = sync_writeback_pages(*p);
+		ratelimit = *p;
 		*p = 0;
 		preempt_enable();
 		balance_dirty_pages(mapping, ratelimit);



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 02/17] writeback: IO-less balance_dirty_pages()
@ 2010-09-12 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Chris Mason, Dave Chinner, Jan Kara, Peter Zijlstra,
	Jens Axboe, Wu Fengguang, Andrew Morton, Theodore Ts'o,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Christoph Hellwig,
	Li Shaohua

[-- Attachment #1: writeback-bw-throttle.patch --]
[-- Type: text/plain, Size: 21057 bytes --]

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

This patch introduces the basic framework, which will be further
consolidated by the next patches.

RATIONALS
=========

The current balance_dirty_pages() is rather IO inefficient.

- concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- small nr_to_write for fast arrays

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  This limits current balance_dirty_pages() to small inefficient IOs.

For the above two reasons, it's much better to shift IO to the flusher
threads and let balance_dirty_pages() just wait for enough time or progress.

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. This is found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control

- the pause time in each balance_dirty_pages() invocations
- the number of pages dirtied before calling balance_dirty_pages()

for smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than  10ms, which burns CPU power)
- avoid too large pause time (more than 100ms, which hurts responsiveness)
- avoid big fluctuations of pause times

For example, when doing a simple cp on ext4 with mem=4G HZ=250.

before patch, the pause time fluctuates from 0 to 324ms
(and the stall time may grow very large for slow devices)

[ 1237.139962] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
[ 1237.207489] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.225190] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.234488] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.244692] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.375231] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1237.443035] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1237.574630] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1237.642394] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1237.666320] balance_dirty_pages: write_chunk=1536 pages_written=57 pause=5
[ 1237.973365] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=81
[ 1238.212626] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
[ 1238.280431] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1238.412029] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1238.412791] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0

after patch, the pause time remains stable around 32ms

cp-2687  [007]  1452.189182: balance_dirty_pages: bdi=8:0 weight=56% thresh=123892 gap=7700 dirtied=128 pause=8 bw=64494573
cp-2687  [007]  1452.198232: balance_dirty_pages: bdi=8:0 weight=56% thresh=123900 gap=7708 dirtied=128 pause=8 bw=64562234
cp-2687  [006]  1452.205170: balance_dirty_pages: bdi=8:0 weight=56% thresh=123907 gap=7715 dirtied=128 pause=8 bw=64613176
cp-2687  [006]  1452.213115: balance_dirty_pages: bdi=8:0 weight=56% thresh=123907 gap=7715 dirtied=128 pause=8 bw=64613829
cp-2687  [006]  1452.222154: balance_dirty_pages: bdi=8:0 weight=56% thresh=123908 gap=7716 dirtied=128 pause=8 bw=64622856
cp-2687  [002]  1452.229099: balance_dirty_pages: bdi=8:0 weight=56% thresh=123908 gap=7716 dirtied=128 pause=8 bw=64623508
cp-2687  [002]  1452.237012: balance_dirty_pages: bdi=8:0 weight=56% thresh=123915 gap=7723 dirtied=128 pause=8 bw=64682786
cp-2687  [002]  1452.246157: balance_dirty_pages: bdi=8:0 weight=56% thresh=123915 gap=7723 dirtied=128 pause=8 bw=64683437
cp-2687  [006]  1452.253043: balance_dirty_pages: bdi=8:0 weight=56% thresh=123922 gap=7730 dirtied=128 pause=8 bw=64734358
cp-2687  [006]  1452.261899: balance_dirty_pages: bdi=8:0 weight=57% thresh=123917 gap=7725 dirtied=128 pause=8 bw=64765323
cp-2687  [006]  1452.268939: balance_dirty_pages: bdi=8:0 weight=57% thresh=123924 gap=7732 dirtied=128 pause=8 bw=64816229
cp-2687  [002]  1452.276932: balance_dirty_pages: bdi=8:0 weight=57% thresh=123930 gap=7738 dirtied=128 pause=8 bw=64867113
cp-2687  [002]  1452.285889: balance_dirty_pages: bdi=8:0 weight=57% thresh=123931 gap=7739 dirtied=128 pause=8 bw=64876082


CONTROL SYSTEM
==============

The current task_dirty_limit() adjusts bdi_thresh according to the dirty
"weight" of the current task, which is the percent of pages recently
dirtied by the task. If 100% pages are recently dirtied by the task, it
will lower bdi_thresh by 1/8. If only 1% pages are dirtied by the task,
it will return almost unmodified bdi_thresh. In this way, a heavy
dirtier will get blocked at (bdi_thresh-bdi_thresh/8) while allowing a
light dirtier to progress (the latter won't be blocked because R << B in
fig.1).

Fig.1 before patch, a heavy dirtier and a light dirtier
                                                R
----------------------------------------------+-o---------------------------*--|
                                              L A                           B  T
  T: bdi_dirty_limit
  L: bdi_dirty_limit - bdi_dirty_limit/8

  R: bdi_reclaimable + bdi_writeback

  A: bdi_thresh for a heavy dirtier ~= R ~= L
  B: bdi_thresh for a light dirtier ~= T

If B is a newly started heavy dirtier, then it will slowly gain weight
and A will lose weight.  The bdi_thresh for A and B will be approaching
the center of region (L, T) and eventually stabilize there.

Fig.2 before patch, two heavy dirtiers converging to the same threshold
                                                             R
----------------------------------------------+--------------o-*---------------|
                                              L              A B               T

Now for IO-less balance_dirty_pages(), let's do it in a "bandwidth control"
way. In fig.3, a soft dirty limit region (L, A) is introduced. When R enters
this region, the task may be throttled for T seconds on every N pages it dirtied.
Let's call (N/T) the "throttle bandwidth". It is computed by the following fomula:

        throttle_bandwidth = bdi_bandwidth * (A - R) / (A - L)
where
        L = A - A/16
	A = T - T/16

So when there is only one heavy dirtier (fig.3),

        R ~= L
        throttle_bandwidth ~= bdi_bandwidth

It's a stable balance:
- when R > L, then throttle_bandwidth < bdi_bandwidth, so R will decrease to L
- when R < L, then throttle_bandwidth > bdi_bandwidth, so R will increase to L

Fig.3 after patch, one heavy dirtier

                                                |
    throttle_bandwidth ~= bdi_bandwidth  =>     o
                                                | o
                                                |   o
                                                |     o
                                                |       o
                                                |         o
                                              L |           o
----------------------------------------------+-+-------------o----------------|
                                                R             A                T
  T: bdi_dirty_limit
  A: task_dirty_limit = bdi_dirty_limit - bdi_dirty_limit/16
  L: task_dirty_limit - task_dirty_limit/16

  R: bdi_reclaimable + bdi_writeback ~= L

When there comes a new cp task, its weight will grow from 0 to 50%.
When the weight is still small, it's considered a light dirtier and it's
allowed to dirty pages much faster than the bdi write bandwidth. In fact
initially it won't be throttled at all when R < Lb where Lb=B-B/16 and B~=T.

Fig.4 after patch, an old cp + a newly started cp

                     (throttle bandwidth) =>    *
                                                | *
                                                |   *
                                                |     *
                                                |       *
                                                |         *
                                                |           *
                                                |             *
                      throttle bandwidth  =>    o               *
                                                | o               *
                                                |   o               *
                                                |     o               *
                                                |       o               *
                                                |         o               *
                                                |           o               *
------------------------------------------------+-------------o---------------*|
                                                R             A               BT

So R will quickly grow large (fig.5). As the two heavy dirtiers' weight
converge to 50%, the points A, B will go towards each other and
eventually become one in fig.5. R will stabilize around A-A/32 where
A=B=T-T/16. throttle_bandwidth will stabilize around bdi_bandwidth/2.
There won't be big oscillations between A and B, because as long as A
coincides with B, their throttle_bandwidth and dirtied pages will be
equal, A's weight will stop decreasing and B's weight will stop growing,
so the two points won't keep moving and cross each other. So it's a
pretty stable control system. The only problem is, it converges a bit
slow (except for really fast storage array).

Fig.5 after patch, the two heavy dirtiers converging to the same bandwidth

                                                         |
                                                         |
                                 throttle bandwidth  =>  *
                                                         | *
                                 throttle bandwidth  =>  o   *
                                                         | o   *
                                                         |   o   *
                                                         |     o   *
                                                         |       o   *
                                                         |         o   *
---------------------------------------------------------+-----------o---*-----|
                                                         R           A   B     T

Note that the application "think time" is ignored for simplicity in the
above discussions.  With non-zero user space think time, the balance
point will slightly drift and not a big deal otherwise.

PSEUDO CODE
===========

balance_dirty_pages():

	if (dirty_soft_thresh exceeded &&
	      bdi_soft_thresh exceeded)
		sleep (pages_dirtied / throttle_bandwidth)

	while (bdi_thresh exceeded) {
		sleep 200ms
		break if (bdi dirty/writeback pages) _dropped_ more than
			8 * (pages_dirtied by this task)
	}

	while (dirty_thresh exceeded)
		sleep 200ms

Basically there are three level of throttling now.

- normally the dirtier will be adaptively throttled with good timing

- when bdi_thresh is exceeded, the task will be throttled until bdi
  dirty/writeback pages go down reasonably large

- when dirty_thresh is exceeded, the task will be throttled for
  arbitrary long time

BENCHMARKS
==========

The test box has a 4-core 3.2GHz CPU, 4GB mem and a SATA disk.

For each filesystem, the following command is run 3 times.

time (dd if=/dev/zero of=/tmp/10G bs=1M count=10240; sync); rm /tmp/10G

	    2.6.36-rc2-mm1	2.6.36-rc2-mm1+balance_dirty_pages
average real time
ext2        236.377s            232.144s              -1.8%
ext3        226.245s            225.751s              -0.2%
ext4        178.742s            179.343s              +0.3%
xfs         183.562s            179.808s              -2.0%
btrfs       179.044s            179.461s              +0.2%
NFS         645.627s            628.937s              -2.6%

average system time
ext2         22.142s             19.656s             -11.2%
ext3         34.175s             32.462s              -5.0%
ext4         23.440s             21.162s              -9.7%
xfs          19.089s             16.069s             -15.8%
btrfs        12.212s             11.670s              -4.4%
NFS          16.807s             17.410s              +3.6%

total user time
sum           0.136s              0.084s             -38.2%

In a more recent run of the tests, it's in fact slightly slower.

ext2         49.500 MB/s         49.200 MB/s          -0.6%
ext3         50.133 MB/s         50.000 MB/s          -0.3%
ext4         64.000 MB/s         63.200 MB/s          -1.2%
xfs          63.500 MB/s         63.167 MB/s          -0.5%
btrfs        63.133 MB/s         63.033 MB/s          -0.2%
NFS          16.833 MB/s         16.867 MB/s          +0.2%

In general there are no big IO performance changes for desktop users,
except for some noticeable reduction of CPU overheads. It should
mainly benefit file servers with heavy concurrent writers on fast
storage arrays.

CC: Chris Mason <chris.mason@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |    9 +++
 mm/page-writeback.c       |   95 +++++++++++-------------------------
 2 files changed, 39 insertions(+), 65 deletions(-)

--- linux-next.orig/include/linux/writeback.h	2010-09-09 15:43:29.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-09-12 12:51:20.000000000 +0800
@@ -14,6 +14,15 @@ extern struct list_head inode_in_use;
 extern struct list_head inode_unused;
 
 /*
+ * The 1/16 region under the bdi dirty threshold is set aside for elastic
+ * throttling. In rare cases when the threshold is exceeded, more rigid
+ * throttling will be imposed, which will inevitably stall the dirtier task
+ * for seconds (or more) at _one_ time. The rare case could be a fork bomb
+ * where every new task dirties some more pages.
+ */
+#define DIRTY_SOFT_THROTTLE_RATIO	16
+
+/*
  * fs/fs-writeback.c
  */
 enum writeback_sync_modes {
--- linux-next.orig/mm/page-writeback.c	2010-09-09 15:43:29.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-09-12 13:18:08.000000000 +0800
@@ -42,20 +42,6 @@
  */
 static long ratelimit_pages = 32;
 
-/*
- * When balance_dirty_pages decides that the caller needs to perform some
- * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than dirtied pages to ensure that reasonably
- * large amounts of I/O are submitted.
- */
-static inline long sync_writeback_pages(unsigned long dirtied)
-{
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
-	return dirtied + dirtied / 2;
-}
-
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
@@ -279,7 +265,7 @@ static unsigned long task_dirty_limit(st
 {
 	long numerator, denominator;
 	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty >> 3;
+	u64 inv = dirty / DIRTY_SOFT_THROTTLE_RATIO;
 
 	task_dirties_fraction(tsk, &numerator, &denominator);
 	inv *= numerator;
@@ -473,26 +459,26 @@ unsigned long bdi_dirty_limit(struct bac
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
 	long nr_reclaimable, bdi_nr_reclaimable;
 	long nr_writeback, bdi_nr_writeback;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
+	unsigned long pause;
+	unsigned long gap;
+	unsigned long bw;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
 	for (;;) {
-		struct writeback_control wbc = {
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-			.range_cyclic	= 1,
-		};
-
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_writeback = global_page_state(NR_WRITEBACK);
@@ -529,6 +515,23 @@ static void balance_dirty_pages(struct a
 			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
+		if (bdi_nr_reclaimable + bdi_nr_writeback <=
+			bdi_thresh - bdi_thresh / DIRTY_SOFT_THROTTLE_RATIO)
+			goto check_exceeded;
+
+		gap = bdi_thresh > (bdi_nr_reclaimable + bdi_nr_writeback) ?
+		      bdi_thresh - (bdi_nr_reclaimable + bdi_nr_writeback) : 0;
+
+		bw = (100 << 20) * gap /
+				(bdi_thresh / DIRTY_SOFT_THROTTLE_RATIO + 1);
+
+		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
+		pause = clamp_val(pause, 1, HZ/5);
+
+		__set_current_state(TASK_INTERRUPTIBLE);
+		io_schedule_timeout(pause);
+
+check_exceeded:
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the
 		 * global "hard" limit. The former helps to prevent heavy IO
@@ -544,35 +547,6 @@ static void balance_dirty_pages(struct a
 
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
-
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_wbc_balance_dirty_start(&wbc, bdi);
-		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes_wb(&bdi->wb, &wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			trace_wbc_balance_dirty_written(&wbc, bdi);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
-		}
-		trace_wbc_balance_dirty_wait(&wbc, bdi);
-		__set_current_state(TASK_INTERRUPTIBLE);
-		io_schedule_timeout(pause);
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
 	}
 
 	if (!dirty_exceeded && bdi->dirty_exceeded)
@@ -581,16 +555,7 @@ static void balance_dirty_pages(struct a
 	if (writeback_in_progress(bdi))
 		return;
 
-	/*
-	 * In laptop mode, we wait until hitting the higher threshold before
-	 * starting background writeout, and then write out all the way down
-	 * to the lower threshold.  So slow writers cause minimal disk activity.
-	 *
-	 * In normal mode, we start background writeout at the lower
-	 * background_thresh, to keep the amount of dirty memory low.
-	 */
-	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	if (nr_reclaimable > background_thresh)
 		bdi_start_background_writeback(bdi);
 }
 
@@ -638,7 +603,7 @@ void balance_dirty_pages_ratelimited_nr(
 	p =  &__get_cpu_var(bdp_ratelimits);
 	*p += nr_pages_dirtied;
 	if (unlikely(*p >= ratelimit)) {
-		ratelimit = sync_writeback_pages(*p);
+		ratelimit = *p;
 		*p = 0;
 		preempt_enable();
 		balance_dirty_pages(mapping, ratelimit);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 03/17] writeback: per-task rate limit to balance_dirty_pages()
  2010-09-12 15:49 ` Wu Fengguang
@ 2010-09-12 15:49   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Wu Fengguang, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

[-- Attachment #1: writeback-per-task-dirty-count.patch --]
[-- Type: text/plain, Size: 3051 bytes --]

Try to limit the dirty throttle pause time in range (10ms, 100ms),
by controlling how many pages are dirtied before doing a throttle pause.

The dirty count will be directly billed to the task struct. Slow start
and quick back off is employed, so that the stable range will be biased
towards 10ms. Another intention is for fine timing control of slow
devices, which may need to pause for 100ms for several pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    7 +++++++
 mm/page-writeback.c   |   31 ++++++++++++++-----------------
 2 files changed, 21 insertions(+), 17 deletions(-)

--- linux-next.orig/include/linux/sched.h	2010-09-12 13:10:54.000000000 +0800
+++ linux-next/include/linux/sched.h	2010-09-12 13:16:20.000000000 +0800
@@ -1455,6 +1455,13 @@ struct task_struct {
 	int make_it_fail;
 #endif
 	struct prop_local_single dirties;
+	/*
+	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
+	 * balance_dirty_pages() for some dirty throttling pause
+	 */
+	int nr_dirtied;
+	int nr_dirtied_pause;
+
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
 	struct latency_record latency_record[LT_SAVECOUNT];
--- linux-next.orig/mm/page-writeback.c	2010-09-12 13:10:54.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-09-12 13:12:48.000000000 +0800
@@ -529,6 +529,12 @@ static void balance_dirty_pages(struct a
 		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
 		pause = clamp_val(pause, 1, HZ/5);
 
+		if (pause > HZ/10) {
+			current->nr_dirtied_pause >>= 1;
+			current->nr_dirtied_pause++;
+		} else if (pause < HZ/100)
+			current->nr_dirtied_pause++;
+
 		__set_current_state(TASK_INTERRUPTIBLE);
 		io_schedule_timeout(pause);
 
@@ -570,8 +576,6 @@ void set_page_dirty_balance(struct page 
 	}
 }
 
-static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
-
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
  * @mapping: address_space which was dirtied
@@ -589,28 +593,21 @@ static DEFINE_PER_CPU(unsigned long, bdp
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 					unsigned long nr_pages_dirtied)
 {
-	unsigned long ratelimit;
-	unsigned long *p;
+	if (!current->nr_dirtied_pause)
+		current->nr_dirtied_pause =
+			mapping->backing_dev_info->dirty_exceeded ?
+			8 : ratelimit_pages;
 
-	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
-		ratelimit = 8;
+	current->nr_dirtied += nr_pages_dirtied;
 
 	/*
 	 * Check the rate limiting. Also, we do not want to throttle real-time
 	 * tasks in balance_dirty_pages(). Period.
 	 */
-	preempt_disable();
-	p =  &__get_cpu_var(bdp_ratelimits);
-	*p += nr_pages_dirtied;
-	if (unlikely(*p >= ratelimit)) {
-		ratelimit = *p;
-		*p = 0;
-		preempt_enable();
-		balance_dirty_pages(mapping, ratelimit);
-		return;
+	if (current->nr_dirtied >= current->nr_dirtied_pause) {
+		balance_dirty_pages(mapping, current->nr_dirtied);
+		current->nr_dirtied = 0;
 	}
-	preempt_enable();
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 03/17] writeback: per-task rate limit to balance_dirty_pages()
@ 2010-09-12 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Wu Fengguang, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

[-- Attachment #1: writeback-per-task-dirty-count.patch --]
[-- Type: text/plain, Size: 3276 bytes --]

Try to limit the dirty throttle pause time in range (10ms, 100ms),
by controlling how many pages are dirtied before doing a throttle pause.

The dirty count will be directly billed to the task struct. Slow start
and quick back off is employed, so that the stable range will be biased
towards 10ms. Another intention is for fine timing control of slow
devices, which may need to pause for 100ms for several pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    7 +++++++
 mm/page-writeback.c   |   31 ++++++++++++++-----------------
 2 files changed, 21 insertions(+), 17 deletions(-)

--- linux-next.orig/include/linux/sched.h	2010-09-12 13:10:54.000000000 +0800
+++ linux-next/include/linux/sched.h	2010-09-12 13:16:20.000000000 +0800
@@ -1455,6 +1455,13 @@ struct task_struct {
 	int make_it_fail;
 #endif
 	struct prop_local_single dirties;
+	/*
+	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
+	 * balance_dirty_pages() for some dirty throttling pause
+	 */
+	int nr_dirtied;
+	int nr_dirtied_pause;
+
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
 	struct latency_record latency_record[LT_SAVECOUNT];
--- linux-next.orig/mm/page-writeback.c	2010-09-12 13:10:54.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-09-12 13:12:48.000000000 +0800
@@ -529,6 +529,12 @@ static void balance_dirty_pages(struct a
 		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
 		pause = clamp_val(pause, 1, HZ/5);
 
+		if (pause > HZ/10) {
+			current->nr_dirtied_pause >>= 1;
+			current->nr_dirtied_pause++;
+		} else if (pause < HZ/100)
+			current->nr_dirtied_pause++;
+
 		__set_current_state(TASK_INTERRUPTIBLE);
 		io_schedule_timeout(pause);
 
@@ -570,8 +576,6 @@ void set_page_dirty_balance(struct page 
 	}
 }
 
-static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
-
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
  * @mapping: address_space which was dirtied
@@ -589,28 +593,21 @@ static DEFINE_PER_CPU(unsigned long, bdp
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 					unsigned long nr_pages_dirtied)
 {
-	unsigned long ratelimit;
-	unsigned long *p;
+	if (!current->nr_dirtied_pause)
+		current->nr_dirtied_pause =
+			mapping->backing_dev_info->dirty_exceeded ?
+			8 : ratelimit_pages;
 
-	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
-		ratelimit = 8;
+	current->nr_dirtied += nr_pages_dirtied;
 
 	/*
 	 * Check the rate limiting. Also, we do not want to throttle real-time
 	 * tasks in balance_dirty_pages(). Period.
 	 */
-	preempt_disable();
-	p =  &__get_cpu_var(bdp_ratelimits);
-	*p += nr_pages_dirtied;
-	if (unlikely(*p >= ratelimit)) {
-		ratelimit = *p;
-		*p = 0;
-		preempt_enable();
-		balance_dirty_pages(mapping, ratelimit);
-		return;
+	if (current->nr_dirtied >= current->nr_dirtied_pause) {
+		balance_dirty_pages(mapping, current->nr_dirtied);
+		current->nr_dirtied = 0;
 	}
-	preempt_enable();
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 04/17] writeback: quit throttling when bdi dirty/writeback pages go down
  2010-09-12 15:49 ` Wu Fengguang
@ 2010-09-12 15:49   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Wu Fengguang, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

[-- Attachment #1: writeback-bdi-throttle-break.patch --]
[-- Type: text/plain, Size: 2269 bytes --]

Tests show that bdi_thresh may take minutes to ramp up on a typical
desktop. The time should be improvable but cannot be eliminated totally.
So when (background_thresh + dirty_thresh)/2 is reached and
balance_dirty_pages() starts to throttle the task, it will suddenly find
the (still low and ramping up) bdi_thresh is exceeded _excessively_. Here
we definitely don't want to stall the task for one minute. So introduce
an alternative way to break out of the loop when the bdi dirty/write
pages has dropped by a reasonable amount.

When dirty_background_ratio is set close to dirty_ratio, bdi_thresh may
also be constantly exceeded due to the task_dirty_limit() gap.

It will take at least 200ms before trying to break out.

(pages_dirtied * 8) is used because in this situation pages_dirtied will
typically be small numbers (eg. 3 pages) due to the fast back off logic.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2010-09-09 15:51:38.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-09-12 13:10:02.000000000 +0800
@@ -463,6 +463,7 @@ static void balance_dirty_pages(struct a
 {
 	long nr_reclaimable, bdi_nr_reclaimable;
 	long nr_writeback, bdi_nr_writeback;
+	long bdi_prev_dirty3 = 0;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
@@ -516,6 +517,20 @@ static void balance_dirty_pages(struct a
 			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
+		/*
+		 * bdi_thresh could get exceeded for long time:
+		 * - bdi_thresh takes some time to ramp up from the initial 0
+		 * - users may set dirty_background_ratio close to dirty_ratio
+		 *   (at least 1/8 gap is preferred)
+		 * So offer a complementary way to break out of the loop when
+		 * enough bdi pages have been cleaned during our pause time.
+		 */
+		if (nr_reclaimable + nr_writeback <= dirty_thresh &&
+		    bdi_prev_dirty3 - (bdi_nr_reclaimable + bdi_nr_writeback) >
+							(long)pages_dirtied * 8)
+			break;
+		bdi_prev_dirty3 = bdi_nr_reclaimable + bdi_nr_writeback;
+
 		if (bdi_nr_reclaimable + bdi_nr_writeback <=
 			bdi_thresh - bdi_thresh / DIRTY_SOFT_THROTTLE_RATIO)
 			goto check_exceeded;



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 04/17] writeback: quit throttling when bdi dirty/writeback pages go down
@ 2010-09-12 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Wu Fengguang, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

[-- Attachment #1: writeback-bdi-throttle-break.patch --]
[-- Type: text/plain, Size: 2494 bytes --]

Tests show that bdi_thresh may take minutes to ramp up on a typical
desktop. The time should be improvable but cannot be eliminated totally.
So when (background_thresh + dirty_thresh)/2 is reached and
balance_dirty_pages() starts to throttle the task, it will suddenly find
the (still low and ramping up) bdi_thresh is exceeded _excessively_. Here
we definitely don't want to stall the task for one minute. So introduce
an alternative way to break out of the loop when the bdi dirty/write
pages has dropped by a reasonable amount.

When dirty_background_ratio is set close to dirty_ratio, bdi_thresh may
also be constantly exceeded due to the task_dirty_limit() gap.

It will take at least 200ms before trying to break out.

(pages_dirtied * 8) is used because in this situation pages_dirtied will
typically be small numbers (eg. 3 pages) due to the fast back off logic.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2010-09-09 15:51:38.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-09-12 13:10:02.000000000 +0800
@@ -463,6 +463,7 @@ static void balance_dirty_pages(struct a
 {
 	long nr_reclaimable, bdi_nr_reclaimable;
 	long nr_writeback, bdi_nr_writeback;
+	long bdi_prev_dirty3 = 0;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
@@ -516,6 +517,20 @@ static void balance_dirty_pages(struct a
 			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
+		/*
+		 * bdi_thresh could get exceeded for long time:
+		 * - bdi_thresh takes some time to ramp up from the initial 0
+		 * - users may set dirty_background_ratio close to dirty_ratio
+		 *   (at least 1/8 gap is preferred)
+		 * So offer a complementary way to break out of the loop when
+		 * enough bdi pages have been cleaned during our pause time.
+		 */
+		if (nr_reclaimable + nr_writeback <= dirty_thresh &&
+		    bdi_prev_dirty3 - (bdi_nr_reclaimable + bdi_nr_writeback) >
+							(long)pages_dirtied * 8)
+			break;
+		bdi_prev_dirty3 = bdi_nr_reclaimable + bdi_nr_writeback;
+
 		if (bdi_nr_reclaimable + bdi_nr_writeback <=
 			bdi_thresh - bdi_thresh / DIRTY_SOFT_THROTTLE_RATIO)
 			goto check_exceeded;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 05/17] writeback: quit throttling when signal pending
  2010-09-12 15:49 ` Wu Fengguang
@ 2010-09-12 15:49   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Wu Fengguang, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

[-- Attachment #1: writeback-break-on-signal-pending.patch --]
[-- Type: text/plain, Size: 604 bytes --]

This allows quick response to Ctrl-C etc. for impatient users.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    3 +++
 1 file changed, 3 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2010-09-09 16:01:14.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-09-09 16:02:27.000000000 +0800
@@ -553,6 +553,9 @@ static void balance_dirty_pages(struct a
 		__set_current_state(TASK_INTERRUPTIBLE);
 		io_schedule_timeout(pause);
 
+		if (signal_pending(current))
+			break;
+
 check_exceeded:
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 05/17] writeback: quit throttling when signal pending
@ 2010-09-12 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Wu Fengguang, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

[-- Attachment #1: writeback-break-on-signal-pending.patch --]
[-- Type: text/plain, Size: 829 bytes --]

This allows quick response to Ctrl-C etc. for impatient users.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    3 +++
 1 file changed, 3 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2010-09-09 16:01:14.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-09-09 16:02:27.000000000 +0800
@@ -553,6 +553,9 @@ static void balance_dirty_pages(struct a
 		__set_current_state(TASK_INTERRUPTIBLE);
 		io_schedule_timeout(pause);
 
+		if (signal_pending(current))
+			break;
+
 check_exceeded:
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 06/17] writeback: move task dirty fraction to balance_dirty_pages()
  2010-09-12 15:49 ` Wu Fengguang
@ 2010-09-12 15:49   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Peter Zijlstra, Wu Fengguang, Andrew Morton,
	Theodore Ts'o, Dave Chinner, Jan Kara, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Chris Mason, Christoph Hellwig,
	Li Shaohua

[-- Attachment #1: writeback-task-weight.patch --]
[-- Type: text/plain, Size: 1765 bytes --]

This is simple code refactor preparing for a trace event that exposes
the fraction info. It may be merged with the next patch eventually.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-09-09 16:02:27.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-09-09 16:02:30.000000000 +0800
@@ -260,14 +260,12 @@ static inline void task_dirties_fraction
  * effectively curb the growth of dirty pages. Light dirtiers with high enough
  * dirty threshold may never get throttled.
  */
-static unsigned long task_dirty_limit(struct task_struct *tsk,
-				       unsigned long bdi_dirty)
+static unsigned long task_dirty_limit(unsigned long bdi_dirty,
+				      long numerator, long denominator)
 {
-	long numerator, denominator;
 	unsigned long dirty = bdi_dirty;
 	u64 inv = dirty / DIRTY_SOFT_THROTTLE_RATIO;
 
-	task_dirties_fraction(tsk, &numerator, &denominator);
 	inv *= numerator;
 	do_div(inv, denominator);
 
@@ -472,6 +470,7 @@ static void balance_dirty_pages(struct a
 	unsigned long bw;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	long numerator, denominator;
 
 	for (;;) {
 		/*
@@ -496,8 +495,10 @@ static void balance_dirty_pages(struct a
 				(background_thresh + dirty_thresh) / 2)
 			break;
 
+		task_dirties_fraction(current, &numerator, &denominator);
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
-		bdi_thresh = task_dirty_limit(current, bdi_thresh);
+		bdi_thresh = task_dirty_limit(bdi_thresh,
+					      numerator, denominator);
 
 		/*
 		 * In order to avoid the stacked BDI deadlock we need



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 06/17] writeback: move task dirty fraction to balance_dirty_pages()
@ 2010-09-12 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Peter Zijlstra, Wu Fengguang, Andrew Morton,
	Theodore Ts'o, Dave Chinner, Jan Kara, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Chris Mason, Christoph Hellwig,
	Li Shaohua

[-- Attachment #1: writeback-task-weight.patch --]
[-- Type: text/plain, Size: 1990 bytes --]

This is simple code refactor preparing for a trace event that exposes
the fraction info. It may be merged with the next patch eventually.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-09-09 16:02:27.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-09-09 16:02:30.000000000 +0800
@@ -260,14 +260,12 @@ static inline void task_dirties_fraction
  * effectively curb the growth of dirty pages. Light dirtiers with high enough
  * dirty threshold may never get throttled.
  */
-static unsigned long task_dirty_limit(struct task_struct *tsk,
-				       unsigned long bdi_dirty)
+static unsigned long task_dirty_limit(unsigned long bdi_dirty,
+				      long numerator, long denominator)
 {
-	long numerator, denominator;
 	unsigned long dirty = bdi_dirty;
 	u64 inv = dirty / DIRTY_SOFT_THROTTLE_RATIO;
 
-	task_dirties_fraction(tsk, &numerator, &denominator);
 	inv *= numerator;
 	do_div(inv, denominator);
 
@@ -472,6 +470,7 @@ static void balance_dirty_pages(struct a
 	unsigned long bw;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	long numerator, denominator;
 
 	for (;;) {
 		/*
@@ -496,8 +495,10 @@ static void balance_dirty_pages(struct a
 				(background_thresh + dirty_thresh) / 2)
 			break;
 
+		task_dirties_fraction(current, &numerator, &denominator);
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
-		bdi_thresh = task_dirty_limit(current, bdi_thresh);
+		bdi_thresh = task_dirty_limit(bdi_thresh,
+					      numerator, denominator);
 
 		/*
 		 * In order to avoid the stacked BDI deadlock we need


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 07/17] writeback: add trace event for balance_dirty_pages()
  2010-09-12 15:49 ` Wu Fengguang
@ 2010-09-12 15:49   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Wu Fengguang, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

[-- Attachment #1: writeback-trace-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 5516 bytes --]

Here is an interesting test to verify the theory with balance_dirty_pages()
tracing. On a partition that can do ~60MB/s, a sparse file is created and
4 rsync tasks with different write bandwidth started:

	dd if=/dev/zero of=/mnt/1T bs=1M count=1 seek=1024000
	echo 1 > /debug/tracing/events/writeback/balance_dirty_pages/enable

	rsync localhost:/mnt/1T /mnt/a --bwlimit 10000&
	rsync localhost:/mnt/1T /mnt/A --bwlimit 10000&
	rsync localhost:/mnt/1T /mnt/b --bwlimit 20000&
	rsync localhost:/mnt/1T /mnt/c --bwlimit 30000&

	(btw, is there a dd that can do --bwlimit?
	 it's a bit twisted to use rsync, or lftp net:limit-rate)

Trace outputs within 0.1 second, grouped by tasks:

rsync-3824  [004] 15002.076447: balance_dirty_pages: bdi=btrfs-2 weight=15% thresh=130876 gap=5340 dirtied=192 pause=20 bw=34855512

rsync-3822  [003] 15002.091701: balance_dirty_pages: bdi=btrfs-2 weight=15% thresh=130777 gap=5113 dirtied=192 pause=20 bw=33419085

rsync-3821  [006] 15002.004667: balance_dirty_pages: bdi=btrfs-2 weight=30% thresh=129570 gap=3714 dirtied=64 pause=8 bw=24541625
rsync-3821  [006] 15002.012654: balance_dirty_pages: bdi=btrfs-2 weight=30% thresh=129589 gap=3733 dirtied=64 pause=8 bw=24651878
rsync-3821  [006] 15002.021838: balance_dirty_pages: bdi=btrfs-2 weight=30% thresh=129604 gap=3748 dirtied=64 pause=8 bw=24768628
rsync-3821  [004] 15002.091193: balance_dirty_pages: bdi=btrfs-2 weight=29% thresh=129583 gap=3983 dirtied=64 pause=8 bw=26274370
rsync-3821  [004] 15002.102729: balance_dirty_pages: bdi=btrfs-2 weight=29% thresh=129594 gap=3802 dirtied=64 pause=8 bw=25015422
rsync-3821  [000] 15002.109252: balance_dirty_pages: bdi=btrfs-2 weight=29% thresh=129619 gap=3827 dirtied=64 pause=8 bw=25179342

rsync-3823  [002] 15002.009029: balance_dirty_pages: bdi=btrfs-2 weight=39% thresh=128762 gap=2842 dirtied=64 pause=12 bw=18885024
rsync-3823  [002] 15002.021598: balance_dirty_pages: bdi=btrfs-2 weight=39% thresh=128813 gap=3021 dirtied=64 pause=12 bw=20088241
rsync-3823  [003] 15002.032973: balance_dirty_pages: bdi=btrfs-2 weight=39% thresh=128805 gap=2885 dirtied=64 pause=12 bw=19146453
rsync-3823  [003] 15002.048800: balance_dirty_pages: bdi=btrfs-2 weight=39% thresh=128823 gap=2967 dirtied=64 pause=12 bw=19673334
rsync-3823  [003] 15002.060728: balance_dirty_pages: bdi=btrfs-2 weight=39% thresh=128821 gap=3221 dirtied=64 pause=12 bw=21362280
rsync-3823  [000] 15002.073152: balance_dirty_pages: bdi=btrfs-2 weight=39% thresh=128825 gap=3225 dirtied=64 pause=12 bw=21385010
rsync-3823  [005] 15002.090111: balance_dirty_pages: bdi=btrfs-2 weight=39% thresh=128782 gap=3214 dirtied=64 pause=12 bw=21333266
rsync-3823  [004] 15002.102520: balance_dirty_pages: bdi=btrfs-2 weight=39% thresh=128764 gap=3036 dirtied=64 pause=12 bw=20104559

The data vividly show that
- the heaviest writer is throttled a bit (weight=39%)
- the lighter writers run at full speed (weight=15%,15%,30%)
  rsync is smart enough to compensate for the in-kernel pause time

Don't be confused by the 'bw=' field. It does not take user space
think time into account.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   47 +++++++++++++++++++++++++++--
 mm/page-writeback.c              |    5 +++
 2 files changed, 49 insertions(+), 3 deletions(-)

--- linux-next.orig/include/trace/events/writeback.h	2010-09-07 23:16:28.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-09-07 23:20:17.000000000 +0800
@@ -148,11 +148,52 @@ DEFINE_EVENT(wbc_class, name, \
 DEFINE_WBC_EVENT(wbc_writeback_start);
 DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
-DEFINE_WBC_EVENT(wbc_balance_dirty_start);
-DEFINE_WBC_EVENT(wbc_balance_dirty_written);
-DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+TRACE_EVENT(balance_dirty_pages,
+
+	TP_PROTO(struct backing_dev_info *bdi,
+		 unsigned long task_weight,
+		 unsigned long bdi_thresh,
+		 unsigned long gap,
+		 unsigned long bw,
+		 unsigned long pages_dirtied,
+		 unsigned long pause),
+
+	TP_ARGS(bdi, task_weight, bdi_thresh, gap, bw, pages_dirtied, pause),
+
+	TP_STRUCT__entry(
+		__array(char,		bdi, 32)
+		__field(unsigned long,	task_weight)
+		__field(unsigned long,	bdi_thresh)
+		__field(unsigned long,	gap)
+		__field(unsigned long,	bw)
+		__field(unsigned long,	pages_dirtied)
+		__field(unsigned long,	pause)
+	),
+
+	TP_fast_assign(
+		strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
+		__entry->task_weight	= task_weight;
+		__entry->bdi_thresh	= bdi_thresh;
+		__entry->gap		= gap;
+		__entry->bw		= bw;
+		__entry->pages_dirtied	= pages_dirtied;
+		__entry->pause		= pause * 1000 / HZ;
+	),
+
+	TP_printk("bdi=%s weight=%lu%% "
+		  "thresh=%lu gap=%lu dirtied=%lu pause=%lu bw=%lu",
+		  __entry->bdi,
+		  __entry->task_weight,
+		  __entry->bdi_thresh,
+		  __entry->gap,
+		  __entry->pages_dirtied,
+		  __entry->pause,
+		  __entry->bw
+		  )
+);
+
 #endif /* _TRACE_WRITEBACK_H */
 
 /* This part must be outside protection */
--- linux-next.orig/mm/page-writeback.c	2010-09-07 23:20:02.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-09-07 23:21:58.000000000 +0800
@@ -536,6 +536,11 @@ static void balance_dirty_pages(struct a
 		} else if (pause < HZ/100)
 			current->nr_dirtied_pause++;
 
+		trace_balance_dirty_pages(bdi,
+					  100 * numerator / denominator,
+					  bdi_thresh, gap, bw,
+					  pages_dirtied, pause);
+
 		__set_current_state(TASK_INTERRUPTIBLE);
 		io_schedule_timeout(pause);
 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 07/17] writeback: add trace event for balance_dirty_pages()
@ 2010-09-12 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Wu Fengguang, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

[-- Attachment #1: writeback-trace-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 5741 bytes --]

Here is an interesting test to verify the theory with balance_dirty_pages()
tracing. On a partition that can do ~60MB/s, a sparse file is created and
4 rsync tasks with different write bandwidth started:

	dd if=/dev/zero of=/mnt/1T bs=1M count=1 seek=1024000
	echo 1 > /debug/tracing/events/writeback/balance_dirty_pages/enable

	rsync localhost:/mnt/1T /mnt/a --bwlimit 10000&
	rsync localhost:/mnt/1T /mnt/A --bwlimit 10000&
	rsync localhost:/mnt/1T /mnt/b --bwlimit 20000&
	rsync localhost:/mnt/1T /mnt/c --bwlimit 30000&

	(btw, is there a dd that can do --bwlimit?
	 it's a bit twisted to use rsync, or lftp net:limit-rate)

Trace outputs within 0.1 second, grouped by tasks:

rsync-3824  [004] 15002.076447: balance_dirty_pages: bdi=btrfs-2 weight=15% thresh=130876 gap=5340 dirtied=192 pause=20 bw=34855512

rsync-3822  [003] 15002.091701: balance_dirty_pages: bdi=btrfs-2 weight=15% thresh=130777 gap=5113 dirtied=192 pause=20 bw=33419085

rsync-3821  [006] 15002.004667: balance_dirty_pages: bdi=btrfs-2 weight=30% thresh=129570 gap=3714 dirtied=64 pause=8 bw=24541625
rsync-3821  [006] 15002.012654: balance_dirty_pages: bdi=btrfs-2 weight=30% thresh=129589 gap=3733 dirtied=64 pause=8 bw=24651878
rsync-3821  [006] 15002.021838: balance_dirty_pages: bdi=btrfs-2 weight=30% thresh=129604 gap=3748 dirtied=64 pause=8 bw=24768628
rsync-3821  [004] 15002.091193: balance_dirty_pages: bdi=btrfs-2 weight=29% thresh=129583 gap=3983 dirtied=64 pause=8 bw=26274370
rsync-3821  [004] 15002.102729: balance_dirty_pages: bdi=btrfs-2 weight=29% thresh=129594 gap=3802 dirtied=64 pause=8 bw=25015422
rsync-3821  [000] 15002.109252: balance_dirty_pages: bdi=btrfs-2 weight=29% thresh=129619 gap=3827 dirtied=64 pause=8 bw=25179342

rsync-3823  [002] 15002.009029: balance_dirty_pages: bdi=btrfs-2 weight=39% thresh=128762 gap=2842 dirtied=64 pause=12 bw=18885024
rsync-3823  [002] 15002.021598: balance_dirty_pages: bdi=btrfs-2 weight=39% thresh=128813 gap=3021 dirtied=64 pause=12 bw=20088241
rsync-3823  [003] 15002.032973: balance_dirty_pages: bdi=btrfs-2 weight=39% thresh=128805 gap=2885 dirtied=64 pause=12 bw=19146453
rsync-3823  [003] 15002.048800: balance_dirty_pages: bdi=btrfs-2 weight=39% thresh=128823 gap=2967 dirtied=64 pause=12 bw=19673334
rsync-3823  [003] 15002.060728: balance_dirty_pages: bdi=btrfs-2 weight=39% thresh=128821 gap=3221 dirtied=64 pause=12 bw=21362280
rsync-3823  [000] 15002.073152: balance_dirty_pages: bdi=btrfs-2 weight=39% thresh=128825 gap=3225 dirtied=64 pause=12 bw=21385010
rsync-3823  [005] 15002.090111: balance_dirty_pages: bdi=btrfs-2 weight=39% thresh=128782 gap=3214 dirtied=64 pause=12 bw=21333266
rsync-3823  [004] 15002.102520: balance_dirty_pages: bdi=btrfs-2 weight=39% thresh=128764 gap=3036 dirtied=64 pause=12 bw=20104559

The data vividly show that
- the heaviest writer is throttled a bit (weight=39%)
- the lighter writers run at full speed (weight=15%,15%,30%)
  rsync is smart enough to compensate for the in-kernel pause time

Don't be confused by the 'bw=' field. It does not take user space
think time into account.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   47 +++++++++++++++++++++++++++--
 mm/page-writeback.c              |    5 +++
 2 files changed, 49 insertions(+), 3 deletions(-)

--- linux-next.orig/include/trace/events/writeback.h	2010-09-07 23:16:28.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-09-07 23:20:17.000000000 +0800
@@ -148,11 +148,52 @@ DEFINE_EVENT(wbc_class, name, \
 DEFINE_WBC_EVENT(wbc_writeback_start);
 DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
-DEFINE_WBC_EVENT(wbc_balance_dirty_start);
-DEFINE_WBC_EVENT(wbc_balance_dirty_written);
-DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+TRACE_EVENT(balance_dirty_pages,
+
+	TP_PROTO(struct backing_dev_info *bdi,
+		 unsigned long task_weight,
+		 unsigned long bdi_thresh,
+		 unsigned long gap,
+		 unsigned long bw,
+		 unsigned long pages_dirtied,
+		 unsigned long pause),
+
+	TP_ARGS(bdi, task_weight, bdi_thresh, gap, bw, pages_dirtied, pause),
+
+	TP_STRUCT__entry(
+		__array(char,		bdi, 32)
+		__field(unsigned long,	task_weight)
+		__field(unsigned long,	bdi_thresh)
+		__field(unsigned long,	gap)
+		__field(unsigned long,	bw)
+		__field(unsigned long,	pages_dirtied)
+		__field(unsigned long,	pause)
+	),
+
+	TP_fast_assign(
+		strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
+		__entry->task_weight	= task_weight;
+		__entry->bdi_thresh	= bdi_thresh;
+		__entry->gap		= gap;
+		__entry->bw		= bw;
+		__entry->pages_dirtied	= pages_dirtied;
+		__entry->pause		= pause * 1000 / HZ;
+	),
+
+	TP_printk("bdi=%s weight=%lu%% "
+		  "thresh=%lu gap=%lu dirtied=%lu pause=%lu bw=%lu",
+		  __entry->bdi,
+		  __entry->task_weight,
+		  __entry->bdi_thresh,
+		  __entry->gap,
+		  __entry->pages_dirtied,
+		  __entry->pause,
+		  __entry->bw
+		  )
+);
+
 #endif /* _TRACE_WRITEBACK_H */
 
 /* This part must be outside protection */
--- linux-next.orig/mm/page-writeback.c	2010-09-07 23:20:02.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-09-07 23:21:58.000000000 +0800
@@ -536,6 +536,11 @@ static void balance_dirty_pages(struct a
 		} else if (pause < HZ/100)
 			current->nr_dirtied_pause++;
 
+		trace_balance_dirty_pages(bdi,
+					  100 * numerator / denominator,
+					  bdi_thresh, gap, bw,
+					  pages_dirtied, pause);
+
 		__set_current_state(TASK_INTERRUPTIBLE);
 		io_schedule_timeout(pause);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 08/17] writeback: account per-bdi accumulated written pages
  2010-09-12 15:49 ` Wu Fengguang
@ 2010-09-12 15:49   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Jan Kara, Wu Fengguang, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

[-- Attachment #1: writeback-bdi-written.patch --]
[-- Type: text/plain, Size: 2040 bytes --]

Introduce the BDI_WRITTEN counter. It will be used for estimating the
bdi's write bandwidth.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    6 ++++--
 mm/page-writeback.c         |    1 +
 3 files changed, 6 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2010-09-09 15:39:25.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-09-09 16:02:43.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
 
--- linux-next.orig/mm/backing-dev.c	2010-09-09 15:39:25.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-09-09 16:02:43.000000000 +0800
@@ -91,6 +91,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:   %8lu kB\n"
 		   "DirtyThresh:      %8lu kB\n"
 		   "BackgroundThresh: %8lu kB\n"
+		   "BdiWritten:       %8lu kB\n"
 		   "b_dirty:          %8lu\n"
 		   "b_io:             %8lu\n"
 		   "b_more_io:        %8lu\n"
@@ -98,8 +99,9 @@ static int bdi_debug_stats_show(struct s
 		   "state:            %8lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
-		   K(bdi_thresh), K(dirty_thresh),
-		   K(background_thresh), nr_dirty, nr_io, nr_more_io,
+		   K(bdi_thresh), K(dirty_thresh), K(background_thresh),
+		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+		   nr_dirty, nr_io, nr_more_io,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K
 
--- linux-next.orig/mm/page-writeback.c	2010-09-09 16:02:33.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-09-09 16:02:43.000000000 +0800
@@ -1305,6 +1305,7 @@ int test_clear_page_writeback(struct pag
 						PAGECACHE_TAG_WRITEBACK);
 			if (bdi_cap_account_writeback(bdi)) {
 				__dec_bdi_stat(bdi, BDI_WRITEBACK);
+				__inc_bdi_stat(bdi, BDI_WRITTEN);
 				__bdi_writeout_inc(bdi);
 			}
 		}



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 08/17] writeback: account per-bdi accumulated written pages
@ 2010-09-12 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Jan Kara, Wu Fengguang, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

[-- Attachment #1: writeback-bdi-written.patch --]
[-- Type: text/plain, Size: 2265 bytes --]

Introduce the BDI_WRITTEN counter. It will be used for estimating the
bdi's write bandwidth.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    6 ++++--
 mm/page-writeback.c         |    1 +
 3 files changed, 6 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2010-09-09 15:39:25.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-09-09 16:02:43.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
 
--- linux-next.orig/mm/backing-dev.c	2010-09-09 15:39:25.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-09-09 16:02:43.000000000 +0800
@@ -91,6 +91,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:   %8lu kB\n"
 		   "DirtyThresh:      %8lu kB\n"
 		   "BackgroundThresh: %8lu kB\n"
+		   "BdiWritten:       %8lu kB\n"
 		   "b_dirty:          %8lu\n"
 		   "b_io:             %8lu\n"
 		   "b_more_io:        %8lu\n"
@@ -98,8 +99,9 @@ static int bdi_debug_stats_show(struct s
 		   "state:            %8lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
-		   K(bdi_thresh), K(dirty_thresh),
-		   K(background_thresh), nr_dirty, nr_io, nr_more_io,
+		   K(bdi_thresh), K(dirty_thresh), K(background_thresh),
+		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+		   nr_dirty, nr_io, nr_more_io,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K
 
--- linux-next.orig/mm/page-writeback.c	2010-09-09 16:02:33.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-09-09 16:02:43.000000000 +0800
@@ -1305,6 +1305,7 @@ int test_clear_page_writeback(struct pag
 						PAGECACHE_TAG_WRITEBACK);
 			if (bdi_cap_account_writeback(bdi)) {
 				__dec_bdi_stat(bdi, BDI_WRITEBACK);
+				__inc_bdi_stat(bdi, BDI_WRITTEN);
 				__bdi_writeout_inc(bdi);
 			}
 		}


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 09/17] writeback: bdi write bandwidth estimation
  2010-09-12 15:49 ` Wu Fengguang
@ 2010-09-12 15:49   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Li Shaohua, Wu Fengguang, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig

[-- Attachment #1: writeback-bandwidth-estimation-in-flusher.patch --]
[-- Type: text/plain, Size: 5872 bytes --]

The estimation value will start from 100MB/s and adapt to the real
bandwidth in seconds.  It's pretty accurate for common filesystems.

As the first use case, it replaces the static 100MB/s value used for
'bw' calculation in balance_dirty_pages().

The overheads won't be high because the bdi bandwidth udpate only occurs
in >10ms intervals.

Initially it's only estimated in balance_dirty_pages() because this is
the most reliable place to get reasonable large bandwidth -- the bdi is
normally fully utilized when bdi_thresh is reached.

Then Shaohua recommends to also do it in the flusher thread, to keep the
value updated when there are only periodic/background writeback and no
tasks throttled.

The estimation cannot be done purely in the flusher thread because it's
not sufficient for NFS. NFS writeback won't block at get_request_wait(),
so tend to complete quickly. Another problem is, slow devices may take
dozens of seconds to write the initial 64MB chunk (write_bandwidth
starts with 100MB/s, this translates to 64MB nr_to_write). So it may
take more than 1 minute to adapt to the smallish bandwidth if the
bandwidth is only updated in the flusher thread.

CC: Li Shaohua <shaohua.li@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |    4 ++++
 include/linux/backing-dev.h |    1 +
 include/linux/writeback.h   |    3 +++
 mm/backing-dev.c            |    1 +
 mm/page-writeback.c         |   33 ++++++++++++++++++++++++++++++++-
 5 files changed, 41 insertions(+), 1 deletion(-)

--- linux-next.orig/include/linux/backing-dev.h	2010-09-09 16:02:43.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-09-09 16:02:45.000000000 +0800
@@ -76,6 +76,7 @@ struct backing_dev_info {
 
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
+	int write_bandwidth;
 
 	unsigned int min_ratio;
 	unsigned int max_ratio, max_prop_frac;
--- linux-next.orig/mm/backing-dev.c	2010-09-09 16:02:43.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-09-09 16:02:45.000000000 +0800
@@ -658,6 +658,7 @@ int bdi_init(struct backing_dev_info *bd
 			goto err;
 	}
 
+	bdi->write_bandwidth = 100 << 20;
 	bdi->dirty_exceeded = 0;
 	err = prop_local_init_percpu(&bdi->completions);
 
--- linux-next.orig/fs/fs-writeback.c	2010-09-09 14:13:21.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-09-09 16:02:46.000000000 +0800
@@ -603,6 +603,8 @@ static long wb_writeback(struct bdi_writ
 		.range_cyclic		= work->range_cyclic,
 	};
 	unsigned long oldest_jif;
+	unsigned long bw_time;
+	s64 bw_written = 0;
 	long wrote = 0;
 	struct inode *inode;
 
@@ -616,6 +618,7 @@ static long wb_writeback(struct bdi_writ
 		wbc.range_end = LLONG_MAX;
 	}
 
+	bdi_update_write_bandwidth(wb->bdi, &bw_time, &bw_written);
 	wbc.wb_start = jiffies; /* livelock avoidance */
 	for (;;) {
 		/*
@@ -641,6 +644,7 @@ static long wb_writeback(struct bdi_writ
 		else
 			writeback_inodes_wb(wb, &wbc);
 		trace_wbc_writeback_written(&wbc, wb->bdi);
+		bdi_update_write_bandwidth(wb->bdi, &bw_time, &bw_written);
 
 		work->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
--- linux-next.orig/mm/page-writeback.c	2010-09-09 16:02:43.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-09-09 16:04:23.000000000 +0800
@@ -449,6 +449,32 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
+				unsigned long *bw_time,
+				s64 *bw_written)
+{
+	unsigned long pages;
+	unsigned long time;
+	unsigned long bw;
+	unsigned long w;
+
+	if (*bw_written == 0)
+		goto start_over;
+
+	time = jiffies - *bw_time;
+	if (time < HZ/100)
+		return;
+
+	pages = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]) - *bw_written;
+	bw = HZ * PAGE_CACHE_SIZE * pages / time;
+	w = clamp_t(unsigned long, time / (HZ/100), 1, 128);
+
+	bdi->write_bandwidth = (bdi->write_bandwidth * (1024-w) + bw * w) >> 10;
+start_over:
+	*bw_written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
+	*bw_time = jiffies;
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -471,6 +497,8 @@ static void balance_dirty_pages(struct a
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	long numerator, denominator;
+	unsigned long bw_time;
+	s64 bw_written = 0;
 
 	for (;;) {
 		/*
@@ -536,10 +564,12 @@ static void balance_dirty_pages(struct a
 			bdi_thresh - bdi_thresh / DIRTY_SOFT_THROTTLE_RATIO)
 			goto check_exceeded;
 
+		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
+
 		gap = bdi_thresh > (bdi_nr_reclaimable + bdi_nr_writeback) ?
 		      bdi_thresh - (bdi_nr_reclaimable + bdi_nr_writeback) : 0;
 
-		bw = (100 << 20) * gap /
+		bw = bdi->write_bandwidth * gap /
 				(bdi_thresh / DIRTY_SOFT_THROTTLE_RATIO + 1);
 
 		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
@@ -562,6 +592,7 @@ static void balance_dirty_pages(struct a
 		if (signal_pending(current))
 			break;
 
+		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
 check_exceeded:
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the
--- linux-next.orig/include/linux/writeback.h	2010-09-09 15:51:38.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-09-09 16:02:46.000000000 +0800
@@ -136,6 +136,9 @@ int dirty_writeback_centisecs_handler(st
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
 			       unsigned long dirty);
+void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
+				unsigned long *bw_time,
+				s64 *bw_written);
 
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 09/17] writeback: bdi write bandwidth estimation
@ 2010-09-12 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Li Shaohua, Wu Fengguang, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig

[-- Attachment #1: writeback-bandwidth-estimation-in-flusher.patch --]
[-- Type: text/plain, Size: 6097 bytes --]

The estimation value will start from 100MB/s and adapt to the real
bandwidth in seconds.  It's pretty accurate for common filesystems.

As the first use case, it replaces the static 100MB/s value used for
'bw' calculation in balance_dirty_pages().

The overheads won't be high because the bdi bandwidth udpate only occurs
in >10ms intervals.

Initially it's only estimated in balance_dirty_pages() because this is
the most reliable place to get reasonable large bandwidth -- the bdi is
normally fully utilized when bdi_thresh is reached.

Then Shaohua recommends to also do it in the flusher thread, to keep the
value updated when there are only periodic/background writeback and no
tasks throttled.

The estimation cannot be done purely in the flusher thread because it's
not sufficient for NFS. NFS writeback won't block at get_request_wait(),
so tend to complete quickly. Another problem is, slow devices may take
dozens of seconds to write the initial 64MB chunk (write_bandwidth
starts with 100MB/s, this translates to 64MB nr_to_write). So it may
take more than 1 minute to adapt to the smallish bandwidth if the
bandwidth is only updated in the flusher thread.

CC: Li Shaohua <shaohua.li@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |    4 ++++
 include/linux/backing-dev.h |    1 +
 include/linux/writeback.h   |    3 +++
 mm/backing-dev.c            |    1 +
 mm/page-writeback.c         |   33 ++++++++++++++++++++++++++++++++-
 5 files changed, 41 insertions(+), 1 deletion(-)

--- linux-next.orig/include/linux/backing-dev.h	2010-09-09 16:02:43.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-09-09 16:02:45.000000000 +0800
@@ -76,6 +76,7 @@ struct backing_dev_info {
 
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
+	int write_bandwidth;
 
 	unsigned int min_ratio;
 	unsigned int max_ratio, max_prop_frac;
--- linux-next.orig/mm/backing-dev.c	2010-09-09 16:02:43.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-09-09 16:02:45.000000000 +0800
@@ -658,6 +658,7 @@ int bdi_init(struct backing_dev_info *bd
 			goto err;
 	}
 
+	bdi->write_bandwidth = 100 << 20;
 	bdi->dirty_exceeded = 0;
 	err = prop_local_init_percpu(&bdi->completions);
 
--- linux-next.orig/fs/fs-writeback.c	2010-09-09 14:13:21.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-09-09 16:02:46.000000000 +0800
@@ -603,6 +603,8 @@ static long wb_writeback(struct bdi_writ
 		.range_cyclic		= work->range_cyclic,
 	};
 	unsigned long oldest_jif;
+	unsigned long bw_time;
+	s64 bw_written = 0;
 	long wrote = 0;
 	struct inode *inode;
 
@@ -616,6 +618,7 @@ static long wb_writeback(struct bdi_writ
 		wbc.range_end = LLONG_MAX;
 	}
 
+	bdi_update_write_bandwidth(wb->bdi, &bw_time, &bw_written);
 	wbc.wb_start = jiffies; /* livelock avoidance */
 	for (;;) {
 		/*
@@ -641,6 +644,7 @@ static long wb_writeback(struct bdi_writ
 		else
 			writeback_inodes_wb(wb, &wbc);
 		trace_wbc_writeback_written(&wbc, wb->bdi);
+		bdi_update_write_bandwidth(wb->bdi, &bw_time, &bw_written);
 
 		work->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
--- linux-next.orig/mm/page-writeback.c	2010-09-09 16:02:43.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-09-09 16:04:23.000000000 +0800
@@ -449,6 +449,32 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
+				unsigned long *bw_time,
+				s64 *bw_written)
+{
+	unsigned long pages;
+	unsigned long time;
+	unsigned long bw;
+	unsigned long w;
+
+	if (*bw_written == 0)
+		goto start_over;
+
+	time = jiffies - *bw_time;
+	if (time < HZ/100)
+		return;
+
+	pages = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]) - *bw_written;
+	bw = HZ * PAGE_CACHE_SIZE * pages / time;
+	w = clamp_t(unsigned long, time / (HZ/100), 1, 128);
+
+	bdi->write_bandwidth = (bdi->write_bandwidth * (1024-w) + bw * w) >> 10;
+start_over:
+	*bw_written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
+	*bw_time = jiffies;
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -471,6 +497,8 @@ static void balance_dirty_pages(struct a
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	long numerator, denominator;
+	unsigned long bw_time;
+	s64 bw_written = 0;
 
 	for (;;) {
 		/*
@@ -536,10 +564,12 @@ static void balance_dirty_pages(struct a
 			bdi_thresh - bdi_thresh / DIRTY_SOFT_THROTTLE_RATIO)
 			goto check_exceeded;
 
+		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
+
 		gap = bdi_thresh > (bdi_nr_reclaimable + bdi_nr_writeback) ?
 		      bdi_thresh - (bdi_nr_reclaimable + bdi_nr_writeback) : 0;
 
-		bw = (100 << 20) * gap /
+		bw = bdi->write_bandwidth * gap /
 				(bdi_thresh / DIRTY_SOFT_THROTTLE_RATIO + 1);
 
 		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
@@ -562,6 +592,7 @@ static void balance_dirty_pages(struct a
 		if (signal_pending(current))
 			break;
 
+		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
 check_exceeded:
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the
--- linux-next.orig/include/linux/writeback.h	2010-09-09 15:51:38.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-09-09 16:02:46.000000000 +0800
@@ -136,6 +136,9 @@ int dirty_writeback_centisecs_handler(st
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
 			       unsigned long dirty);
+void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
+				unsigned long *bw_time,
+				s64 *bw_written);
 
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 10/17] writeback: show bdi write bandwidth in debugfs
  2010-09-12 15:49 ` Wu Fengguang
@ 2010-09-12 15:49   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Theodore Tso, Jan Kara, Peter Zijlstra, Wu Fengguang,
	Andrew Morton, Dave Chinner, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

[-- Attachment #1: writeback-bandwidth-show.patch --]
[-- Type: text/plain, Size: 2034 bytes --]

Add a "BdiWriteBandwidth" entry (and indent others) in /debug/bdi/*/stats.

btw increase digital field width to 10, for keeping the possibly
huge BdiWritten number aligned at least for desktop systems.

This will break user space tools if they are dumb enough to depend on
the number of white spaces.

CC: Theodore Ts'o <tytso@mit.edu>
CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/backing-dev.c |   24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

--- linux-next.orig/mm/backing-dev.c	2010-09-11 08:42:31.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-09-12 11:13:49.000000000 +0800
@@ -86,21 +86,23 @@ static int bdi_debug_stats_show(struct s
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	seq_printf(m,
-		   "BdiWriteback:     %8lu kB\n"
-		   "BdiReclaimable:   %8lu kB\n"
-		   "BdiDirtyThresh:   %8lu kB\n"
-		   "DirtyThresh:      %8lu kB\n"
-		   "BackgroundThresh: %8lu kB\n"
-		   "BdiWritten:       %8lu kB\n"
-		   "b_dirty:          %8lu\n"
-		   "b_io:             %8lu\n"
-		   "b_more_io:        %8lu\n"
-		   "bdi_list:         %8u\n"
-		   "state:            %8lx\n",
+		   "BdiWriteback:       %10lu kB\n"
+		   "BdiReclaimable:     %10lu kB\n"
+		   "BdiDirtyThresh:     %10lu kB\n"
+		   "DirtyThresh:        %10lu kB\n"
+		   "BackgroundThresh:   %10lu kB\n"
+		   "BdiWritten:         %10lu kB\n"
+		   "BdiWriteBandwidth:  %10lu kBps\n"
+		   "b_dirty:            %10lu\n"
+		   "b_io:               %10lu\n"
+		   "b_more_io:          %10lu\n"
+		   "bdi_list:           %10u\n"
+		   "state:              %10lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
 		   K(bdi_thresh), K(dirty_thresh), K(background_thresh),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+		   (unsigned long) bdi->write_bandwidth >> 10,
 		   nr_dirty, nr_io, nr_more_io,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 10/17] writeback: show bdi write bandwidth in debugfs
@ 2010-09-12 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Theodore Tso, Jan Kara, Peter Zijlstra, Wu Fengguang,
	Andrew Morton, Dave Chinner, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

[-- Attachment #1: writeback-bandwidth-show.patch --]
[-- Type: text/plain, Size: 2259 bytes --]

Add a "BdiWriteBandwidth" entry (and indent others) in /debug/bdi/*/stats.

btw increase digital field width to 10, for keeping the possibly
huge BdiWritten number aligned at least for desktop systems.

This will break user space tools if they are dumb enough to depend on
the number of white spaces.

CC: Theodore Ts'o <tytso@mit.edu>
CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/backing-dev.c |   24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

--- linux-next.orig/mm/backing-dev.c	2010-09-11 08:42:31.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-09-12 11:13:49.000000000 +0800
@@ -86,21 +86,23 @@ static int bdi_debug_stats_show(struct s
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	seq_printf(m,
-		   "BdiWriteback:     %8lu kB\n"
-		   "BdiReclaimable:   %8lu kB\n"
-		   "BdiDirtyThresh:   %8lu kB\n"
-		   "DirtyThresh:      %8lu kB\n"
-		   "BackgroundThresh: %8lu kB\n"
-		   "BdiWritten:       %8lu kB\n"
-		   "b_dirty:          %8lu\n"
-		   "b_io:             %8lu\n"
-		   "b_more_io:        %8lu\n"
-		   "bdi_list:         %8u\n"
-		   "state:            %8lx\n",
+		   "BdiWriteback:       %10lu kB\n"
+		   "BdiReclaimable:     %10lu kB\n"
+		   "BdiDirtyThresh:     %10lu kB\n"
+		   "DirtyThresh:        %10lu kB\n"
+		   "BackgroundThresh:   %10lu kB\n"
+		   "BdiWritten:         %10lu kB\n"
+		   "BdiWriteBandwidth:  %10lu kBps\n"
+		   "b_dirty:            %10lu\n"
+		   "b_io:               %10lu\n"
+		   "b_more_io:          %10lu\n"
+		   "bdi_list:           %10u\n"
+		   "state:              %10lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
 		   K(bdi_thresh), K(dirty_thresh), K(background_thresh),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+		   (unsigned long) bdi->write_bandwidth >> 10,
 		   nr_dirty, nr_io, nr_more_io,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 11/17] writeback: make nr_to_write a per-file limit
  2010-09-12 15:49 ` Wu Fengguang
@ 2010-09-12 15:49   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Jan Kara, Wu Fengguang, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

[-- Attachment #1: writeback-single-file-limit.patch --]
[-- Type: text/plain, Size: 1846 bytes --]

This ensures full 4MB (or larger) writeback size for large dirty files.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   11 +++++++++++
 include/linux/writeback.h |    1 +
 2 files changed, 12 insertions(+)

--- linux-next.orig/fs/fs-writeback.c	2010-09-08 13:50:32.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-09-08 13:50:35.000000000 +0800
@@ -304,6 +304,8 @@ static int
 writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 {
 	struct address_space *mapping = inode->i_mapping;
+	long per_file_limit = wbc->per_file_limit;
+	long nr_to_write;
 	unsigned dirty;
 	int ret;
 
@@ -339,8 +341,16 @@ writeback_single_inode(struct inode *ino
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode_lock);
 
+	if (per_file_limit) {
+		nr_to_write = wbc->nr_to_write;
+		wbc->nr_to_write = per_file_limit;
+	}
+
 	ret = do_writepages(mapping, wbc);
 
+	if (per_file_limit)
+		wbc->nr_to_write += nr_to_write - per_file_limit;
+
 	/*
 	 * Make sure to wait on the data before writing out the metadata.
 	 * This is important for filesystems that modify metadata on data
@@ -635,6 +645,7 @@ static long wb_writeback(struct bdi_writ
 			break;
 
 		wbc.more_io = 0;
+		wbc.per_file_limit = MAX_WRITEBACK_PAGES;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
 
--- linux-next.orig/include/linux/writeback.h	2010-09-08 13:50:32.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-09-08 13:50:35.000000000 +0800
@@ -44,6 +44,7 @@ struct writeback_control {
 					   extra jobs and livelock */
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
+	long per_file_limit;		/* Write this many pages for one file */
 	long pages_skipped;		/* Pages which were not written */
 
 	/*



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 11/17] writeback: make nr_to_write a per-file limit
@ 2010-09-12 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Jan Kara, Wu Fengguang, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

[-- Attachment #1: writeback-single-file-limit.patch --]
[-- Type: text/plain, Size: 2071 bytes --]

This ensures full 4MB (or larger) writeback size for large dirty files.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   11 +++++++++++
 include/linux/writeback.h |    1 +
 2 files changed, 12 insertions(+)

--- linux-next.orig/fs/fs-writeback.c	2010-09-08 13:50:32.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-09-08 13:50:35.000000000 +0800
@@ -304,6 +304,8 @@ static int
 writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 {
 	struct address_space *mapping = inode->i_mapping;
+	long per_file_limit = wbc->per_file_limit;
+	long nr_to_write;
 	unsigned dirty;
 	int ret;
 
@@ -339,8 +341,16 @@ writeback_single_inode(struct inode *ino
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode_lock);
 
+	if (per_file_limit) {
+		nr_to_write = wbc->nr_to_write;
+		wbc->nr_to_write = per_file_limit;
+	}
+
 	ret = do_writepages(mapping, wbc);
 
+	if (per_file_limit)
+		wbc->nr_to_write += nr_to_write - per_file_limit;
+
 	/*
 	 * Make sure to wait on the data before writing out the metadata.
 	 * This is important for filesystems that modify metadata on data
@@ -635,6 +645,7 @@ static long wb_writeback(struct bdi_writ
 			break;
 
 		wbc.more_io = 0;
+		wbc.per_file_limit = MAX_WRITEBACK_PAGES;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
 
--- linux-next.orig/include/linux/writeback.h	2010-09-08 13:50:32.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-09-08 13:50:35.000000000 +0800
@@ -44,6 +44,7 @@ struct writeback_control {
 					   extra jobs and livelock */
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
+	long per_file_limit;		/* Write this many pages for one file */
 	long pages_skipped;		/* Pages which were not written */
 
 	/*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 12/17] writeback: scale IO chunk size up to device bandwidth
  2010-09-12 15:49 ` Wu Fengguang
@ 2010-09-12 15:49   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Theodore Tso, Dave Chinner, Chris Mason, Peter Zijlstra,
	Wu Fengguang, Andrew Morton, Jan Kara, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Christoph Hellwig, Li Shaohua

[-- Attachment #1: writeback-128M-MAX_WRITEBACK_PAGES.patch --]
[-- Type: text/plain, Size: 3755 bytes --]

Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
concern of not holding I_SYNC for too long.  (At least, that was the
comment previously.)  This doesn't make sense now because the only
time we wait for I_SYNC is if we are calling sync or fsync, and in
that case we need to write out all of the data anyway.  Previously
there may have been other code paths that waited on I_SYNC, but not
any more.					    -- Theodore Ts'o

According to Christoph, the current writeback size is way too small,
and XFS had a hack that bumped out nr_to_write to four times the value
sent by the VM to be able to saturate medium-sized RAID arrays.  This
value was also problematic for ext4 as well, as it caused large files
to be come interleaved on disk by in 8 megabyte chunks (we bumped up
the nr_to_write by a factor of two).

So remove the MAX_WRITEBACK_PAGES constraint totally. The writeback pages
will adapt to as large as the storage device can write within 1 second.

For a typical hard disk, the resulted chunk size will be 32MB or 64MB.

http://bugzilla.kernel.org/show_bug.cgi?id=13930

CC: Theodore Ts'o <tytso@mit.edu>
CC: Dave Chinner <david@fromorbit.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   31 +++++++++++++++++--------------
 1 file changed, 17 insertions(+), 14 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-09-07 23:26:17.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-09-07 23:26:37.000000000 +0800
@@ -568,15 +568,6 @@ static void __writeback_inodes_sb(struct
 	spin_unlock(&inode_lock);
 }
 
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES     1024
-
 static inline bool over_bground_thresh(void)
 {
 	unsigned long background_thresh, dirty_thresh;
@@ -588,6 +579,18 @@ static inline bool over_bground_thresh(v
 }
 
 /*
+ * Give each inode a nr_to_write that can complete within 1 second.
+ */
+static unsigned long bdi_writeback_chunk_size(struct backing_dev_info *bdi)
+{
+	unsigned long pages;
+
+	pages = max(bdi->write_bandwidth, 4 << 20) >> PAGE_CACHE_SHIFT;
+
+	return rounddown_pow_of_two(pages);
+}
+
+/*
  * Explicit flushing or periodic writeback of "old" data.
  *
  * Define "old": the first time one of an inode's pages is dirtied, we mark the
@@ -645,8 +648,8 @@ static long wb_writeback(struct bdi_writ
 			break;
 
 		wbc.more_io = 0;
-		wbc.per_file_limit = MAX_WRITEBACK_PAGES;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+		wbc.per_file_limit = bdi_writeback_chunk_size(wb->bdi);
+		wbc.nr_to_write = wbc.per_file_limit;
 		wbc.pages_skipped = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
@@ -657,8 +660,8 @@ static long wb_writeback(struct bdi_writ
 		trace_wbc_writeback_written(&wbc, wb->bdi);
 		bdi_update_write_bandwidth(wb->bdi, &bw_time, &bw_written);
 
-		work->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
-		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		work->nr_pages	-= wbc.per_file_limit - wbc.nr_to_write;
+		wrote		+= wbc.per_file_limit - wbc.nr_to_write;
 
 		/*
 		 * If we consumed everything, see if we have more
@@ -673,7 +676,7 @@ static long wb_writeback(struct bdi_writ
 		/*
 		 * Did we write something? Try for more
 		 */
-		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
+		if (wbc.nr_to_write < wbc.per_file_limit)
 			continue;
 		/*
 		 * Nothing written. Wait for some inode to



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 12/17] writeback: scale IO chunk size up to device bandwidth
@ 2010-09-12 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Theodore Tso, Dave Chinner, Chris Mason, Peter Zijlstra,
	Wu Fengguang, Andrew Morton, Jan Kara, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Christoph Hellwig, Li Shaohua

[-- Attachment #1: writeback-128M-MAX_WRITEBACK_PAGES.patch --]
[-- Type: text/plain, Size: 3980 bytes --]

Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
concern of not holding I_SYNC for too long.  (At least, that was the
comment previously.)  This doesn't make sense now because the only
time we wait for I_SYNC is if we are calling sync or fsync, and in
that case we need to write out all of the data anyway.  Previously
there may have been other code paths that waited on I_SYNC, but not
any more.					    -- Theodore Ts'o

According to Christoph, the current writeback size is way too small,
and XFS had a hack that bumped out nr_to_write to four times the value
sent by the VM to be able to saturate medium-sized RAID arrays.  This
value was also problematic for ext4 as well, as it caused large files
to be come interleaved on disk by in 8 megabyte chunks (we bumped up
the nr_to_write by a factor of two).

So remove the MAX_WRITEBACK_PAGES constraint totally. The writeback pages
will adapt to as large as the storage device can write within 1 second.

For a typical hard disk, the resulted chunk size will be 32MB or 64MB.

http://bugzilla.kernel.org/show_bug.cgi?id=13930

CC: Theodore Ts'o <tytso@mit.edu>
CC: Dave Chinner <david@fromorbit.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   31 +++++++++++++++++--------------
 1 file changed, 17 insertions(+), 14 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-09-07 23:26:17.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-09-07 23:26:37.000000000 +0800
@@ -568,15 +568,6 @@ static void __writeback_inodes_sb(struct
 	spin_unlock(&inode_lock);
 }
 
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES     1024
-
 static inline bool over_bground_thresh(void)
 {
 	unsigned long background_thresh, dirty_thresh;
@@ -588,6 +579,18 @@ static inline bool over_bground_thresh(v
 }
 
 /*
+ * Give each inode a nr_to_write that can complete within 1 second.
+ */
+static unsigned long bdi_writeback_chunk_size(struct backing_dev_info *bdi)
+{
+	unsigned long pages;
+
+	pages = max(bdi->write_bandwidth, 4 << 20) >> PAGE_CACHE_SHIFT;
+
+	return rounddown_pow_of_two(pages);
+}
+
+/*
  * Explicit flushing or periodic writeback of "old" data.
  *
  * Define "old": the first time one of an inode's pages is dirtied, we mark the
@@ -645,8 +648,8 @@ static long wb_writeback(struct bdi_writ
 			break;
 
 		wbc.more_io = 0;
-		wbc.per_file_limit = MAX_WRITEBACK_PAGES;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+		wbc.per_file_limit = bdi_writeback_chunk_size(wb->bdi);
+		wbc.nr_to_write = wbc.per_file_limit;
 		wbc.pages_skipped = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
@@ -657,8 +660,8 @@ static long wb_writeback(struct bdi_writ
 		trace_wbc_writeback_written(&wbc, wb->bdi);
 		bdi_update_write_bandwidth(wb->bdi, &bw_time, &bw_written);
 
-		work->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
-		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		work->nr_pages	-= wbc.per_file_limit - wbc.nr_to_write;
+		wrote		+= wbc.per_file_limit - wbc.nr_to_write;
 
 		/*
 		 * If we consumed everything, see if we have more
@@ -673,7 +676,7 @@ static long wb_writeback(struct bdi_writ
 		/*
 		 * Did we write something? Try for more
 		 */
-		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
+		if (wbc.nr_to_write < wbc.per_file_limit)
 			continue;
 		/*
 		 * Nothing written. Wait for some inode to


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 13/17] writeback: reduce per-bdi dirty threshold ramp up time
  2010-09-12 15:49 ` Wu Fengguang
@ 2010-09-12 15:49   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Peter Zijlstra, Richard Kennedy, Martin J. Bligh,
	Wu Fengguang, Andrew Morton, Theodore Ts'o, Dave Chinner,
	Jan Kara, Mel Gorman, Rik van Riel, KOSAKI Motohiro, Chris Mason,
	Christoph Hellwig, Li Shaohua

[-- Attachment #1: writeback-speedup-per-bdi-threshold-ramp-up.patch --]
[-- Type: text/plain, Size: 1555 bytes --]

Reduce the dampening for the control system, yielding faster
convergence.

Currently it converges at a snail's pace for slow devices (in order of
minutes).  For really fast storage, the convergence speed should be fine.

It makes sense to make it reasonably fast for typical desktops.

After patch, it converges in ~10 seconds for 60MB/s writes and 4GB mem.
So expect ~1s for a fast 600MB/s storage under 4GB mem, or ~4s under
16GB mem, which looks good.

$ while true; do grep BdiDirtyThresh /debug/bdi/8:0/stats; sleep 1; done
BdiDirtyThresh:            0 kB
BdiDirtyThresh:       118748 kB
BdiDirtyThresh:       214280 kB
BdiDirtyThresh:       303868 kB
BdiDirtyThresh:       376528 kB
BdiDirtyThresh:       411180 kB
BdiDirtyThresh:       448636 kB
BdiDirtyThresh:       472260 kB
BdiDirtyThresh:       490924 kB
BdiDirtyThresh:       499596 kB
BdiDirtyThresh:       507068 kB
...
DirtyThresh:          530392 kB

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Richard Kennedy <richard@rsk.demon.co.uk>
CC: Martin J. Bligh <mbligh@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2010-08-30 10:24:00.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-08-30 10:27:10.000000000 +0800
@@ -131,7 +131,7 @@ static int calc_period_shift(void)
 	else
 		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
 				100;
-	return 2 + ilog2(dirty_total - 1);
+	return ilog2(dirty_total - 1) - 1;
 }
 
 /*



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 13/17] writeback: reduce per-bdi dirty threshold ramp up time
@ 2010-09-12 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Peter Zijlstra, Richard Kennedy, Martin J. Bligh,
	Wu Fengguang, Andrew Morton, Theodore Ts'o, Dave Chinner,
	Jan Kara, Mel Gorman, Rik van Riel, KOSAKI Motohiro, Chris Mason,
	Christoph Hellwig, Li Shaohua

[-- Attachment #1: writeback-speedup-per-bdi-threshold-ramp-up.patch --]
[-- Type: text/plain, Size: 1780 bytes --]

Reduce the dampening for the control system, yielding faster
convergence.

Currently it converges at a snail's pace for slow devices (in order of
minutes).  For really fast storage, the convergence speed should be fine.

It makes sense to make it reasonably fast for typical desktops.

After patch, it converges in ~10 seconds for 60MB/s writes and 4GB mem.
So expect ~1s for a fast 600MB/s storage under 4GB mem, or ~4s under
16GB mem, which looks good.

$ while true; do grep BdiDirtyThresh /debug/bdi/8:0/stats; sleep 1; done
BdiDirtyThresh:            0 kB
BdiDirtyThresh:       118748 kB
BdiDirtyThresh:       214280 kB
BdiDirtyThresh:       303868 kB
BdiDirtyThresh:       376528 kB
BdiDirtyThresh:       411180 kB
BdiDirtyThresh:       448636 kB
BdiDirtyThresh:       472260 kB
BdiDirtyThresh:       490924 kB
BdiDirtyThresh:       499596 kB
BdiDirtyThresh:       507068 kB
...
DirtyThresh:          530392 kB

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Richard Kennedy <richard@rsk.demon.co.uk>
CC: Martin J. Bligh <mbligh@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2010-08-30 10:24:00.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-08-30 10:27:10.000000000 +0800
@@ -131,7 +131,7 @@ static int calc_period_shift(void)
 	else
 		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
 				100;
-	return 2 + ilog2(dirty_total - 1);
+	return ilog2(dirty_total - 1) - 1;
 }
 
 /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 14/17] vmscan: add scan_control.priority
  2010-09-12 15:49 ` Wu Fengguang
@ 2010-09-12 15:49   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Wu Fengguang, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

[-- Attachment #1: mm-sc-priority.patch --]
[-- Type: text/plain, Size: 1164 bytes --]

It seems most vmscan functions need the priority parameter.
It will simplify code to put it into scan_control.

It will be referenced in the next patch. This patch could convert
the many exising functnions, but let's keep it simple at first.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/vmscan.c |    4 ++++
 1 file changed, 4 insertions(+)

--- linux-next.orig/mm/vmscan.c	2010-09-10 13:13:41.000000000 +0800
+++ linux-next/mm/vmscan.c	2010-09-10 13:17:01.000000000 +0800
@@ -78,6 +78,8 @@ struct scan_control {
 
 	int order;
 
+	int priority;
+
 	/*
 	 * Intend to reclaim enough continuous memory rather than reclaim
 	 * enough amount of memory. i.e, mode for high order allocation.
@@ -1875,6 +1877,7 @@ static unsigned long do_try_to_free_page
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		sc->nr_scanned = 0;
+		sc->priority = priority;
 		if (!priority)
 			disable_swap_token();
 		all_unreclaimable = shrink_zones(priority, zonelist, sc);
@@ -2127,6 +2130,7 @@ loop_again:
 			disable_swap_token();
 
 		all_zones_ok = 1;
+		sc.priority = priority;
 
 		/*
 		 * Scan in the highmem->dma direction for the highest



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 14/17] vmscan: add scan_control.priority
@ 2010-09-12 15:49   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:49 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Wu Fengguang, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

[-- Attachment #1: mm-sc-priority.patch --]
[-- Type: text/plain, Size: 1389 bytes --]

It seems most vmscan functions need the priority parameter.
It will simplify code to put it into scan_control.

It will be referenced in the next patch. This patch could convert
the many exising functnions, but let's keep it simple at first.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/vmscan.c |    4 ++++
 1 file changed, 4 insertions(+)

--- linux-next.orig/mm/vmscan.c	2010-09-10 13:13:41.000000000 +0800
+++ linux-next/mm/vmscan.c	2010-09-10 13:17:01.000000000 +0800
@@ -78,6 +78,8 @@ struct scan_control {
 
 	int order;
 
+	int priority;
+
 	/*
 	 * Intend to reclaim enough continuous memory rather than reclaim
 	 * enough amount of memory. i.e, mode for high order allocation.
@@ -1875,6 +1877,7 @@ static unsigned long do_try_to_free_page
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		sc->nr_scanned = 0;
+		sc->priority = priority;
 		if (!priority)
 			disable_swap_token();
 		all_unreclaimable = shrink_zones(priority, zonelist, sc);
@@ -2127,6 +2130,7 @@ loop_again:
 			disable_swap_token();
 
 		all_zones_ok = 1;
+		sc.priority = priority;
 
 		/*
 		 * Scan in the highmem->dma direction for the highest


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 15/17] mm: lower soft dirty limits on memory pressure
  2010-09-12 15:49 ` Wu Fengguang
@ 2010-09-12 15:50   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:50 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Dave Chinner, Wu Fengguang, Andrew Morton,
	Theodore Ts'o, Jan Kara, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Chris Mason, Christoph Hellwig,
	Li Shaohua

[-- Attachment #1: mm-dynamic-dirty-throttle.patch --]
[-- Type: text/plain, Size: 7643 bytes --]

When memory pressure increases, the LRU lists will be scanned faster and
hence more easily to hit dirty pages and trigger undesirable pageout()s.

Avoiding pageout() reduces a good number of problems, eg. IO efficiency,
responsiveness, vmscan efficiency, etc.

Introduce vm_dirty_pressure to keep track of the vmscan pressure in
dirty page out POV. It ranges from VM_DIRTY_PRESSURE to 0. Lower value
means more pageout() pressure.

The adaption rules are basically "fast down, slow up".

- when encountered dirty pages during vmscan, vm_dirty_pressure will be
  instantly lowered to
  - VM_DIRTY_PRESSURE/2 for priority=DEF_PRIORITY
  - VM_DIRTY_PRESSURE/4 for priority=DEF_PRIORITY-1
  ...
  - 0 for priority=3

- whenever kswapd (of the most pressured node) goes idle, add 1 to
  vm_dirty_pressure. If that node keeps idle, its kswapd will wakeup
  every second to increase vm_dirty_pressure over time.
  
The vm_dirty_pressure_node trick can avoid it being increased too fast
in large NUMA. On the other hand, it may still be decreased too much
when only one node is pressured in large NUMA. (XXX: easy ways to detect
that?)

The above heuristics will keep vm_dirty_pressure near 512 during a
simple write test: cp /dev/zero /tmp/. The test box has 4GB memory.

The ratio (vm_dirty_pressure : VM_DIRTY_PRESSURE) will be directly
multiplied to the _soft_ dirty limits.

- it's able to avoid abrupt change of the applications' progress speed

- it also tries to keep the bdi dirty throttle limit above 1 second
  worth of dirty pages, to avoid hurting IO efficiency

- the background dirty threshold can reach 0, so that when there are no
  heavy dirtiers, all dirty pages can be cleaned

Simply lowering the dirty limits may not immediately knock down the
number of dirty pages (still there are good chances the flusher thread
is running or will run soon).  The wake up of flusher thread will be
carried out in more patches -- maybe revised versions of

	http://lkml.org/lkml/2010/7/29/191
	http://lkml.org/lkml/2010/7/29/189

CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    3 ++
 include/linux/writeback.h |    4 +++
 mm/page-writeback.c       |   38 +++++++++++++++++++++++++++++-------
 mm/vmscan.c               |   18 ++++++++++++++++-
 4 files changed, 55 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-09-11 15:34:38.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-09-11 15:35:03.000000000 +0800
@@ -574,6 +574,9 @@ static inline bool over_bground_thresh(v
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 
+	background_thresh = background_thresh *
+					vm_dirty_pressure / VM_DIRTY_PRESSURE;
+
 	return (global_page_state(NR_FILE_DIRTY) +
 		global_page_state(NR_UNSTABLE_NFS) > background_thresh);
 }
--- linux-next.orig/include/linux/writeback.h	2010-09-11 15:34:37.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-09-11 15:35:01.000000000 +0800
@@ -22,6 +22,8 @@ extern struct list_head inode_unused;
  */
 #define DIRTY_SOFT_THROTTLE_RATIO	16
 
+#define VM_DIRTY_PRESSURE		(1 << 10)
+
 /*
  * fs/fs-writeback.c
  */
@@ -107,6 +109,8 @@ void throttle_vm_writeout(gfp_t gfp_mask
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
 extern unsigned long dirty_background_bytes;
+extern int vm_dirty_pressure;
+extern int vm_dirty_pressure_node;
 extern int vm_dirty_ratio;
 extern unsigned long vm_dirty_bytes;
 extern unsigned int dirty_writeback_interval;
--- linux-next.orig/mm/page-writeback.c	2010-09-11 15:34:38.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-09-11 15:35:01.000000000 +0800
@@ -62,6 +62,14 @@ unsigned long dirty_background_bytes;
 int vm_highmem_is_dirtyable;
 
 /*
+ * The vm_dirty_pressure:VM_DIRTY_PRESSURE ratio is used to lower the soft
+ * dirty throttle limits under memory pressure, so as to reduce the number of
+ * dirty pages and hence undesirable pageout() calls in page reclaim.
+ */
+int vm_dirty_pressure = VM_DIRTY_PRESSURE;
+int vm_dirty_pressure_node;
+
+/*
  * The generator of dirty data starts writeback at this percentage
  */
 int vm_dirty_ratio = 20;
@@ -491,6 +499,7 @@ static void balance_dirty_pages(struct a
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
+	unsigned long thresh;
 	unsigned long pause;
 	unsigned long gap;
 	unsigned long bw;
@@ -519,8 +528,9 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_reclaimable + nr_writeback <=
-				(background_thresh + dirty_thresh) / 2)
+		thresh = (background_thresh + dirty_thresh) / 2;
+		thresh = thresh * vm_dirty_pressure / VM_DIRTY_PRESSURE;
+		if (nr_reclaimable + nr_writeback <= thresh)
 			break;
 
 		task_dirties_fraction(current, &numerator, &denominator);
@@ -560,8 +570,22 @@ static void balance_dirty_pages(struct a
 			break;
 		bdi_prev_total = bdi_nr_reclaimable + bdi_nr_writeback;
 
-		if (bdi_nr_reclaimable + bdi_nr_writeback <=
-			bdi_thresh - bdi_thresh / DIRTY_SOFT_THROTTLE_RATIO)
+
+		thresh = bdi_thresh - bdi_thresh / DIRTY_SOFT_THROTTLE_RATIO;
+		/*
+		 * Lower the soft throttle thresh according to dirty pressure,
+		 * but keep a minimal pool of dirty pages that can be written
+		 * within 1 second to prevent hurting IO performance.
+		 */
+		if (vm_dirty_pressure < VM_DIRTY_PRESSURE) {
+			int dp = vm_dirty_pressure;
+			bw = bdi->write_bandwidth >> PAGE_CACHE_SHIFT;
+			if (thresh * dp / VM_DIRTY_PRESSURE > bw)
+				thresh = thresh * dp / VM_DIRTY_PRESSURE;
+			else if (thresh > bw)
+				thresh = bw;
+		}
+		if (bdi_nr_reclaimable + bdi_nr_writeback <= thresh)
 			goto check_exceeded;
 
 		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
@@ -569,8 +593,7 @@ static void balance_dirty_pages(struct a
 		gap = bdi_thresh > (bdi_nr_reclaimable + bdi_nr_writeback) ?
 		      bdi_thresh - (bdi_nr_reclaimable + bdi_nr_writeback) : 0;
 
-		bw = bdi->write_bandwidth * gap /
-				(bdi_thresh / DIRTY_SOFT_THROTTLE_RATIO + 1);
+		bw = bdi->write_bandwidth * gap / (bdi_thresh - thresh + 1);
 
 		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
 		pause = clamp_val(pause, 1, HZ/5);
@@ -617,7 +640,8 @@ check_exceeded:
 	if (writeback_in_progress(bdi))
 		return;
 
-	if (nr_reclaimable > background_thresh)
+	if (nr_reclaimable > background_thresh *
+					vm_dirty_pressure / VM_DIRTY_PRESSURE)
 		bdi_start_background_writeback(bdi);
 }
 
--- linux-next.orig/mm/vmscan.c	2010-09-11 15:34:39.000000000 +0800
+++ linux-next/mm/vmscan.c	2010-09-11 15:35:01.000000000 +0800
@@ -745,6 +745,16 @@ static unsigned long shrink_page_list(st
 		}
 
 		if (PageDirty(page)) {
+
+			if (file && scanning_global_lru(sc)) {
+				int dp = VM_DIRTY_PRESSURE >>
+					(DEF_PRIORITY + 1 - sc->priority);
+				if (vm_dirty_pressure > dp) {
+					vm_dirty_pressure = dp;
+					vm_dirty_pressure_node = numa_node_id();
+				}
+			}
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
@@ -2354,8 +2364,14 @@ static int kswapd(void *p)
 				 * to sleep until explicitly woken up
 				 */
 				if (!sleeping_prematurely(pgdat, order, remaining)) {
+					int dp = vm_dirty_pressure;
 					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
-					schedule();
+					if (dp < VM_DIRTY_PRESSURE &&
+					    vm_dirty_pressure_node == numa_node_id()) {
+						vm_dirty_pressure = dp + 1;
+						schedule_timeout(HZ);
+					} else
+						schedule();
 				} else {
 					if (remaining)
 						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 15/17] mm: lower soft dirty limits on memory pressure
@ 2010-09-12 15:50   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:50 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Dave Chinner, Wu Fengguang, Andrew Morton,
	Theodore Ts'o, Jan Kara, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Chris Mason, Christoph Hellwig,
	Li Shaohua

[-- Attachment #1: mm-dynamic-dirty-throttle.patch --]
[-- Type: text/plain, Size: 7868 bytes --]

When memory pressure increases, the LRU lists will be scanned faster and
hence more easily to hit dirty pages and trigger undesirable pageout()s.

Avoiding pageout() reduces a good number of problems, eg. IO efficiency,
responsiveness, vmscan efficiency, etc.

Introduce vm_dirty_pressure to keep track of the vmscan pressure in
dirty page out POV. It ranges from VM_DIRTY_PRESSURE to 0. Lower value
means more pageout() pressure.

The adaption rules are basically "fast down, slow up".

- when encountered dirty pages during vmscan, vm_dirty_pressure will be
  instantly lowered to
  - VM_DIRTY_PRESSURE/2 for priority=DEF_PRIORITY
  - VM_DIRTY_PRESSURE/4 for priority=DEF_PRIORITY-1
  ...
  - 0 for priority=3

- whenever kswapd (of the most pressured node) goes idle, add 1 to
  vm_dirty_pressure. If that node keeps idle, its kswapd will wakeup
  every second to increase vm_dirty_pressure over time.
  
The vm_dirty_pressure_node trick can avoid it being increased too fast
in large NUMA. On the other hand, it may still be decreased too much
when only one node is pressured in large NUMA. (XXX: easy ways to detect
that?)

The above heuristics will keep vm_dirty_pressure near 512 during a
simple write test: cp /dev/zero /tmp/. The test box has 4GB memory.

The ratio (vm_dirty_pressure : VM_DIRTY_PRESSURE) will be directly
multiplied to the _soft_ dirty limits.

- it's able to avoid abrupt change of the applications' progress speed

- it also tries to keep the bdi dirty throttle limit above 1 second
  worth of dirty pages, to avoid hurting IO efficiency

- the background dirty threshold can reach 0, so that when there are no
  heavy dirtiers, all dirty pages can be cleaned

Simply lowering the dirty limits may not immediately knock down the
number of dirty pages (still there are good chances the flusher thread
is running or will run soon).  The wake up of flusher thread will be
carried out in more patches -- maybe revised versions of

	http://lkml.org/lkml/2010/7/29/191
	http://lkml.org/lkml/2010/7/29/189

CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    3 ++
 include/linux/writeback.h |    4 +++
 mm/page-writeback.c       |   38 +++++++++++++++++++++++++++++-------
 mm/vmscan.c               |   18 ++++++++++++++++-
 4 files changed, 55 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-09-11 15:34:38.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-09-11 15:35:03.000000000 +0800
@@ -574,6 +574,9 @@ static inline bool over_bground_thresh(v
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 
+	background_thresh = background_thresh *
+					vm_dirty_pressure / VM_DIRTY_PRESSURE;
+
 	return (global_page_state(NR_FILE_DIRTY) +
 		global_page_state(NR_UNSTABLE_NFS) > background_thresh);
 }
--- linux-next.orig/include/linux/writeback.h	2010-09-11 15:34:37.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-09-11 15:35:01.000000000 +0800
@@ -22,6 +22,8 @@ extern struct list_head inode_unused;
  */
 #define DIRTY_SOFT_THROTTLE_RATIO	16
 
+#define VM_DIRTY_PRESSURE		(1 << 10)
+
 /*
  * fs/fs-writeback.c
  */
@@ -107,6 +109,8 @@ void throttle_vm_writeout(gfp_t gfp_mask
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
 extern unsigned long dirty_background_bytes;
+extern int vm_dirty_pressure;
+extern int vm_dirty_pressure_node;
 extern int vm_dirty_ratio;
 extern unsigned long vm_dirty_bytes;
 extern unsigned int dirty_writeback_interval;
--- linux-next.orig/mm/page-writeback.c	2010-09-11 15:34:38.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-09-11 15:35:01.000000000 +0800
@@ -62,6 +62,14 @@ unsigned long dirty_background_bytes;
 int vm_highmem_is_dirtyable;
 
 /*
+ * The vm_dirty_pressure:VM_DIRTY_PRESSURE ratio is used to lower the soft
+ * dirty throttle limits under memory pressure, so as to reduce the number of
+ * dirty pages and hence undesirable pageout() calls in page reclaim.
+ */
+int vm_dirty_pressure = VM_DIRTY_PRESSURE;
+int vm_dirty_pressure_node;
+
+/*
  * The generator of dirty data starts writeback at this percentage
  */
 int vm_dirty_ratio = 20;
@@ -491,6 +499,7 @@ static void balance_dirty_pages(struct a
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
+	unsigned long thresh;
 	unsigned long pause;
 	unsigned long gap;
 	unsigned long bw;
@@ -519,8 +528,9 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_reclaimable + nr_writeback <=
-				(background_thresh + dirty_thresh) / 2)
+		thresh = (background_thresh + dirty_thresh) / 2;
+		thresh = thresh * vm_dirty_pressure / VM_DIRTY_PRESSURE;
+		if (nr_reclaimable + nr_writeback <= thresh)
 			break;
 
 		task_dirties_fraction(current, &numerator, &denominator);
@@ -560,8 +570,22 @@ static void balance_dirty_pages(struct a
 			break;
 		bdi_prev_total = bdi_nr_reclaimable + bdi_nr_writeback;
 
-		if (bdi_nr_reclaimable + bdi_nr_writeback <=
-			bdi_thresh - bdi_thresh / DIRTY_SOFT_THROTTLE_RATIO)
+
+		thresh = bdi_thresh - bdi_thresh / DIRTY_SOFT_THROTTLE_RATIO;
+		/*
+		 * Lower the soft throttle thresh according to dirty pressure,
+		 * but keep a minimal pool of dirty pages that can be written
+		 * within 1 second to prevent hurting IO performance.
+		 */
+		if (vm_dirty_pressure < VM_DIRTY_PRESSURE) {
+			int dp = vm_dirty_pressure;
+			bw = bdi->write_bandwidth >> PAGE_CACHE_SHIFT;
+			if (thresh * dp / VM_DIRTY_PRESSURE > bw)
+				thresh = thresh * dp / VM_DIRTY_PRESSURE;
+			else if (thresh > bw)
+				thresh = bw;
+		}
+		if (bdi_nr_reclaimable + bdi_nr_writeback <= thresh)
 			goto check_exceeded;
 
 		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
@@ -569,8 +593,7 @@ static void balance_dirty_pages(struct a
 		gap = bdi_thresh > (bdi_nr_reclaimable + bdi_nr_writeback) ?
 		      bdi_thresh - (bdi_nr_reclaimable + bdi_nr_writeback) : 0;
 
-		bw = bdi->write_bandwidth * gap /
-				(bdi_thresh / DIRTY_SOFT_THROTTLE_RATIO + 1);
+		bw = bdi->write_bandwidth * gap / (bdi_thresh - thresh + 1);
 
 		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
 		pause = clamp_val(pause, 1, HZ/5);
@@ -617,7 +640,8 @@ check_exceeded:
 	if (writeback_in_progress(bdi))
 		return;
 
-	if (nr_reclaimable > background_thresh)
+	if (nr_reclaimable > background_thresh *
+					vm_dirty_pressure / VM_DIRTY_PRESSURE)
 		bdi_start_background_writeback(bdi);
 }
 
--- linux-next.orig/mm/vmscan.c	2010-09-11 15:34:39.000000000 +0800
+++ linux-next/mm/vmscan.c	2010-09-11 15:35:01.000000000 +0800
@@ -745,6 +745,16 @@ static unsigned long shrink_page_list(st
 		}
 
 		if (PageDirty(page)) {
+
+			if (file && scanning_global_lru(sc)) {
+				int dp = VM_DIRTY_PRESSURE >>
+					(DEF_PRIORITY + 1 - sc->priority);
+				if (vm_dirty_pressure > dp) {
+					vm_dirty_pressure = dp;
+					vm_dirty_pressure_node = numa_node_id();
+				}
+			}
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
@@ -2354,8 +2364,14 @@ static int kswapd(void *p)
 				 * to sleep until explicitly woken up
 				 */
 				if (!sleeping_prematurely(pgdat, order, remaining)) {
+					int dp = vm_dirty_pressure;
 					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
-					schedule();
+					if (dp < VM_DIRTY_PRESSURE &&
+					    vm_dirty_pressure_node == numa_node_id()) {
+						vm_dirty_pressure = dp + 1;
+						schedule_timeout(HZ);
+					} else
+						schedule();
 				} else {
 					if (remaining)
 						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 16/17] mm: create /vm/dirty_pressure in debugfs
  2010-09-12 15:49 ` Wu Fengguang
@ 2010-09-12 15:50   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:50 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Wu Fengguang, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

[-- Attachment #1: mm-debugfs-dirty-pressure.patch --]
[-- Type: text/plain, Size: 1415 bytes --]

Create /debug/vm/ -- a convenient place for kernel hackers to play with
VM variables.

The first exported is vm_dirty_pressure for avoiding excessive pageout()s.
It ranges from 0 to 1024, the lower value, the lower dirty limit.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/vmstat.c |   29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/vmstat.c	2010-09-12 09:50:57.000000000 +0800
+++ linux-next/mm/vmstat.c	2010-09-12 13:27:44.000000000 +0800
@@ -1045,9 +1045,33 @@ static int __init setup_vmstat(void)
 }
 module_init(setup_vmstat)
 
-#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_COMPACTION)
+#if defined(CONFIG_DEBUG_FS)
 #include <linux/debugfs.h>
+#include <linux/writeback.h>
 
+static struct dentry *vm_debug_root;
+
+static int __init vm_debug_init(void)
+{
+	struct dentry *dentry;
+
+	vm_debug_root = debugfs_create_dir("vm", NULL);
+	if (!vm_debug_root)
+		goto fail;
+
+	dentry = debugfs_create_u32("dirty_pressure", 0644,
+				    vm_debug_root, &vm_dirty_pressure);
+	if (!dentry)
+		goto fail;
+
+	return 0;
+fail:
+	return -ENOMEM;
+}
+
+module_init(vm_debug_init);
+
+#if defined(CONFIG_COMPACTION)
 static struct dentry *extfrag_debug_root;
 
 /*
@@ -1202,4 +1226,5 @@ static int __init extfrag_debug_init(voi
 }
 
 module_init(extfrag_debug_init);
-#endif
+#endif /* CONFIG_COMPACTION */
+#endif /* CONFIG_DEBUG_FS */



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 16/17] mm: create /vm/dirty_pressure in debugfs
@ 2010-09-12 15:50   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:50 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Wu Fengguang, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

[-- Attachment #1: mm-debugfs-dirty-pressure.patch --]
[-- Type: text/plain, Size: 1640 bytes --]

Create /debug/vm/ -- a convenient place for kernel hackers to play with
VM variables.

The first exported is vm_dirty_pressure for avoiding excessive pageout()s.
It ranges from 0 to 1024, the lower value, the lower dirty limit.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/vmstat.c |   29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/vmstat.c	2010-09-12 09:50:57.000000000 +0800
+++ linux-next/mm/vmstat.c	2010-09-12 13:27:44.000000000 +0800
@@ -1045,9 +1045,33 @@ static int __init setup_vmstat(void)
 }
 module_init(setup_vmstat)
 
-#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_COMPACTION)
+#if defined(CONFIG_DEBUG_FS)
 #include <linux/debugfs.h>
+#include <linux/writeback.h>
 
+static struct dentry *vm_debug_root;
+
+static int __init vm_debug_init(void)
+{
+	struct dentry *dentry;
+
+	vm_debug_root = debugfs_create_dir("vm", NULL);
+	if (!vm_debug_root)
+		goto fail;
+
+	dentry = debugfs_create_u32("dirty_pressure", 0644,
+				    vm_debug_root, &vm_dirty_pressure);
+	if (!dentry)
+		goto fail;
+
+	return 0;
+fail:
+	return -ENOMEM;
+}
+
+module_init(vm_debug_init);
+
+#if defined(CONFIG_COMPACTION)
 static struct dentry *extfrag_debug_root;
 
 /*
@@ -1202,4 +1226,5 @@ static int __init extfrag_debug_init(voi
 }
 
 module_init(extfrag_debug_init);
-#endif
+#endif /* CONFIG_COMPACTION */
+#endif /* CONFIG_DEBUG_FS */


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 17/17] writeback: consolidate balance_dirty_pages() variable names
  2010-09-12 15:49 ` Wu Fengguang
@ 2010-09-12 15:50   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:50 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Wu Fengguang, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

[-- Attachment #1: writeback-cleanup-name-merge.patch --]
[-- Type: text/plain, Size: 3711 bytes --]

Lots of lenthy tests.. Let's compact the names

	*_dirty3 = dirty + writeback + unstable

balance_dirty_pages() only cares about the above dirty sum except
in one place -- on starting background writeback.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   34 ++++++++++++++++------------------
 1 file changed, 16 insertions(+), 18 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-09-12 13:30:38.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-09-12 13:34:04.000000000 +0800
@@ -493,8 +493,8 @@ start_over:
 static void balance_dirty_pages(struct address_space *mapping,
 				unsigned long pages_dirtied)
 {
-	long nr_reclaimable, bdi_nr_reclaimable;
-	long nr_writeback, bdi_nr_writeback;
+	long nr_reclaimable;
+	long nr_dirty3, bdi_dirty3;
 	long bdi_prev_dirty3 = 0;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
@@ -518,7 +518,7 @@ static void balance_dirty_pages(struct a
 		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+		nr_dirty3 = nr_reclaimable + global_page_state(NR_WRITEBACK);
 
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 
@@ -529,7 +529,7 @@ static void balance_dirty_pages(struct a
 		 */
 		thresh = (background_thresh + dirty_thresh) / 2;
 		thresh = thresh * vm_dirty_pressure / VM_DIRTY_PRESSURE;
-		if (nr_reclaimable + nr_writeback <= thresh)
+		if (nr_dirty3 <= thresh)
 			break;
 
 		task_dirties_fraction(current, &numerator, &denominator);
@@ -548,11 +548,11 @@ static void balance_dirty_pages(struct a
 		 * deltas.
 		 */
 		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
+			bdi_dirty3 = bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
+				     bdi_stat_sum(bdi, BDI_WRITEBACK);
 		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+			bdi_dirty3 = bdi_stat(bdi, BDI_RECLAIMABLE) +
+				     bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
 		/*
@@ -563,11 +563,10 @@ static void balance_dirty_pages(struct a
 		 * So offer a complementary way to break out of the loop when
 		 * enough bdi pages have been cleaned during our pause time.
 		 */
-		if (nr_reclaimable + nr_writeback <= dirty_thresh &&
-		    bdi_prev_dirty3 - (bdi_nr_reclaimable + bdi_nr_writeback) >
-							(long)pages_dirtied * 8)
+		if (nr_dirty3 <= dirty_thresh &&
+		    bdi_prev_dirty3 - bdi_dirty3 > (long)pages_dirtied * 8)
 			break;
-		bdi_prev_dirty3 = bdi_nr_reclaimable + bdi_nr_writeback;
+		bdi_prev_dirty3 = bdi_dirty3;
 
 
 		thresh = bdi_thresh - bdi_thresh / DIRTY_SOFT_THROTTLE_RATIO;
@@ -584,13 +583,13 @@ static void balance_dirty_pages(struct a
 			else if (thresh > bw)
 				thresh = bw;
 		}
-		if (bdi_nr_reclaimable + bdi_nr_writeback <= thresh)
+		if (bdi_dirty3 <= thresh)
 			goto check_exceeded;
 
 		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
 
-		gap = bdi_thresh > (bdi_nr_reclaimable + bdi_nr_writeback) ?
-		      bdi_thresh - (bdi_nr_reclaimable + bdi_nr_writeback) : 0;
+		gap = bdi_thresh > bdi_dirty3 ?
+		      bdi_thresh - bdi_dirty3 : 0;
 
 		bw = bdi->write_bandwidth * gap / (bdi_thresh - thresh + 1);
 
@@ -622,9 +621,8 @@ check_exceeded:
 		 * bdi or process from holding back light ones; The latter is
 		 * the last resort safeguard.
 		 */
-		dirty_exceeded =
-			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
-			|| (nr_reclaimable + nr_writeback > dirty_thresh);
+		dirty_exceeded = (bdi_dirty3 > bdi_thresh) ||
+				  (nr_dirty3 > dirty_thresh);
 
 		if (!dirty_exceeded)
 			break;



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 17/17] writeback: consolidate balance_dirty_pages() variable names
@ 2010-09-12 15:50   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:50 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Wu Fengguang, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

[-- Attachment #1: writeback-cleanup-name-merge.patch --]
[-- Type: text/plain, Size: 3936 bytes --]

Lots of lenthy tests.. Let's compact the names

	*_dirty3 = dirty + writeback + unstable

balance_dirty_pages() only cares about the above dirty sum except
in one place -- on starting background writeback.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   34 ++++++++++++++++------------------
 1 file changed, 16 insertions(+), 18 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-09-12 13:30:38.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-09-12 13:34:04.000000000 +0800
@@ -493,8 +493,8 @@ start_over:
 static void balance_dirty_pages(struct address_space *mapping,
 				unsigned long pages_dirtied)
 {
-	long nr_reclaimable, bdi_nr_reclaimable;
-	long nr_writeback, bdi_nr_writeback;
+	long nr_reclaimable;
+	long nr_dirty3, bdi_dirty3;
 	long bdi_prev_dirty3 = 0;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
@@ -518,7 +518,7 @@ static void balance_dirty_pages(struct a
 		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+		nr_dirty3 = nr_reclaimable + global_page_state(NR_WRITEBACK);
 
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 
@@ -529,7 +529,7 @@ static void balance_dirty_pages(struct a
 		 */
 		thresh = (background_thresh + dirty_thresh) / 2;
 		thresh = thresh * vm_dirty_pressure / VM_DIRTY_PRESSURE;
-		if (nr_reclaimable + nr_writeback <= thresh)
+		if (nr_dirty3 <= thresh)
 			break;
 
 		task_dirties_fraction(current, &numerator, &denominator);
@@ -548,11 +548,11 @@ static void balance_dirty_pages(struct a
 		 * deltas.
 		 */
 		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
+			bdi_dirty3 = bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
+				     bdi_stat_sum(bdi, BDI_WRITEBACK);
 		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+			bdi_dirty3 = bdi_stat(bdi, BDI_RECLAIMABLE) +
+				     bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
 		/*
@@ -563,11 +563,10 @@ static void balance_dirty_pages(struct a
 		 * So offer a complementary way to break out of the loop when
 		 * enough bdi pages have been cleaned during our pause time.
 		 */
-		if (nr_reclaimable + nr_writeback <= dirty_thresh &&
-		    bdi_prev_dirty3 - (bdi_nr_reclaimable + bdi_nr_writeback) >
-							(long)pages_dirtied * 8)
+		if (nr_dirty3 <= dirty_thresh &&
+		    bdi_prev_dirty3 - bdi_dirty3 > (long)pages_dirtied * 8)
 			break;
-		bdi_prev_dirty3 = bdi_nr_reclaimable + bdi_nr_writeback;
+		bdi_prev_dirty3 = bdi_dirty3;
 
 
 		thresh = bdi_thresh - bdi_thresh / DIRTY_SOFT_THROTTLE_RATIO;
@@ -584,13 +583,13 @@ static void balance_dirty_pages(struct a
 			else if (thresh > bw)
 				thresh = bw;
 		}
-		if (bdi_nr_reclaimable + bdi_nr_writeback <= thresh)
+		if (bdi_dirty3 <= thresh)
 			goto check_exceeded;
 
 		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
 
-		gap = bdi_thresh > (bdi_nr_reclaimable + bdi_nr_writeback) ?
-		      bdi_thresh - (bdi_nr_reclaimable + bdi_nr_writeback) : 0;
+		gap = bdi_thresh > bdi_dirty3 ?
+		      bdi_thresh - bdi_dirty3 : 0;
 
 		bw = bdi->write_bandwidth * gap / (bdi_thresh - thresh + 1);
 
@@ -622,9 +621,8 @@ check_exceeded:
 		 * bdi or process from holding back light ones; The latter is
 		 * the last resort safeguard.
 		 */
-		dirty_exceeded =
-			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
-			|| (nr_reclaimable + nr_writeback > dirty_thresh);
+		dirty_exceeded = (bdi_dirty3 > bdi_thresh) ||
+				  (nr_dirty3 > dirty_thresh);
 
 		if (!dirty_exceeded)
 			break;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 08/17] writeback: account per-bdi accumulated written pages
  2010-09-12 15:49   ` Wu Fengguang
@ 2010-09-12 15:59     ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:59 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Jan Kara, Andrew Morton, Theodore Ts'o, Dave Chinner,
	Peter Zijlstra, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Chris Mason, Christoph Hellwig, Li, Shaohua

From: Jan Kara <jack@suse.cz>

Somehow it's dropped by quilt..sorry!

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 08/17] writeback: account per-bdi accumulated written pages
@ 2010-09-12 15:59     ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 15:59 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Jan Kara, Andrew Morton, Theodore Ts'o, Dave Chinner,
	Peter Zijlstra, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Chris Mason, Christoph Hellwig, Li, Shaohua

From: Jan Kara <jack@suse.cz>

Somehow it's dropped by quilt..sorry!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 13/17] writeback: reduce per-bdi dirty threshold ramp up time
  2010-09-12 15:49   ` Wu Fengguang
@ 2010-09-12 16:15     ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 16:15 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Peter Zijlstra, Richard Kennedy, Martin J. Bligh,
	Andrew Morton, Theodore Ts'o, Dave Chinner, Jan Kara,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Chris Mason,
	Christoph Hellwig, Li, Shaohua

On Sun, Sep 12, 2010 at 11:49:58PM +0800, Wu, Fengguang wrote:
> Reduce the dampening for the control system, yielding faster
> convergence.
> 
> Currently it converges at a snail's pace for slow devices (in order of
> minutes).  For really fast storage, the convergence speed should be fine.
> 
> It makes sense to make it reasonably fast for typical desktops.
> 
> After patch, it converges in ~10 seconds for 60MB/s writes and 4GB mem.
> So expect ~1s for a fast 600MB/s storage under 4GB mem, or ~4s under
> 16GB mem, which looks good.
> 
> $ while true; do grep BdiDirtyThresh /debug/bdi/8:0/stats; sleep 1; done
> BdiDirtyThresh:            0 kB
> BdiDirtyThresh:       118748 kB
> BdiDirtyThresh:       214280 kB
> BdiDirtyThresh:       303868 kB
> BdiDirtyThresh:       376528 kB
> BdiDirtyThresh:       411180 kB
> BdiDirtyThresh:       448636 kB
> BdiDirtyThresh:       472260 kB
> BdiDirtyThresh:       490924 kB
> BdiDirtyThresh:       499596 kB
> BdiDirtyThresh:       507068 kB
> ...
> DirtyThresh:          530392 kB

One related observation is, the task fraction may suddenly drop:

dd-4323  [004] 21608.535781: balance_dirty_pages: bdi=8:0 weight=97% thresh=124863 gap=7071 dirtied=513 pause=44 bw=44677838
dd-4323  [004] 21608.579568: balance_dirty_pages: bdi=8:0 weight=97% thresh=124851 gap=7315 dirtied=513 pause=44 bw=46321077
dd-4323  [004] 21608.623586: balance_dirty_pages: bdi=8:0 weight=97% thresh=124852 gap=7156 dirtied=513 pause=44 bw=45199674
dd-4323  [000] 21608.667526: balance_dirty_pages: bdi=8:0 weight=97% thresh=124853 gap=7029 dirtied=513 pause=44 bw=44337926
dd-4323  [000] 21608.711259: balance_dirty_pages: bdi=8:0 weight=97% thresh=124842 gap=7146 dirtied=513 pause=44 bw=45074728
dd-4323  [000] 21608.755051: balance_dirty_pages: bdi=8:0 weight=97% thresh=124843 gap=6891 dirtied=513 pause=48 bw=43356794
dd-4323  [000] 21608.802953: balance_dirty_pages: bdi=8:0 weight=97% thresh=124834 gap=6722 dirtied=513 pause=48 bw=42211067
dd-4323  [000] 21608.850745: balance_dirty_pages: bdi=8:0 weight=97% thresh=124834 gap=6594 dirtied=513 pause=48 bw=41326916
dd-4323  [004] 21608.900524: balance_dirty_pages: bdi=8:0 weight=62% thresh=127735 gap=7575 dirtied=513 pause=40 bw=47863047
dd-4323  [004] 21608.990461: balance_dirty_pages: bdi=8:0 weight=22% thresh=131040 gap=8064 dirtied=513 pause=40 bw=49548668
dd-4323  [004] 21609.030239: balance_dirty_pages: bdi=8:0 weight=23% thresh=130971 gap=7739 dirtied=513 pause=44 bw=47469455
dd-4323  [004] 21609.074075: balance_dirty_pages: bdi=8:0 weight=23% thresh=130915 gap=7427 dirtied=513 pause=44 bw=45455503
dd-4323  [004] 21609.117927: balance_dirty_pages: bdi=8:0 weight=24% thresh=130849 gap=7105 dirtied=513 pause=48 bw=43394683
dd-4323  [004] 21609.165843: balance_dirty_pages: bdi=8:0 weight=25% thresh=130770 gap=6898 dirtied=513 pause=48 bw=42071194
dd-4323  [004] 21609.213769: balance_dirty_pages: bdi=8:0 weight=26% thresh=130719 gap=7103 dirtied=513 pause=48 bw=43366955
dd-4323  [004] 21609.261483: balance_dirty_pages: bdi=8:0 weight=26% thresh=130655 gap=6911 dirtied=513 pause=48 bw=42130514
...
dd-4323  [001] 21619.473748: balance_dirty_pages: bdi=8:0 weight=96% thresh=124200 gap=7656 dirtied=513 pause=36 bw=55354531
dd-4323  [000] 21619.762110: balance_dirty_pages: bdi=8:0 weight=96% thresh=124148 gap=7540 dirtied=513 pause=36 bw=54586428
dd-4323  [000] 21619.804259: balance_dirty_pages: bdi=8:0 weight=96% thresh=124145 gap=7281 dirtied=513 pause=36 bw=52772359
dd-4323  [004] 21619.840740: balance_dirty_pages: bdi=8:0 weight=96% thresh=124133 gap=7397 dirtied=513 pause=36 bw=53627516
dd-4323  [004] 21619.876600: balance_dirty_pages: bdi=8:0 weight=96% thresh=124133 gap=7493 dirtied=513 pause=36 bw=54331060
dd-4323  [004] 21619.912482: balance_dirty_pages: bdi=8:0 weight=97% thresh=124133 gap=7621 dirtied=513 pause=36 bw=55266828
dd-4323  [007] 21619.955231: balance_dirty_pages: bdi=8:0 weight=95% thresh=124242 gap=7410 dirtied=513 pause=36 bw=53695642
dd-4323  [007] 21619.992100: balance_dirty_pages: bdi=8:0 weight=95% thresh=124246 gap=7542 dirtied=513 pause=36 bw=54714918
dd-4323  [007] 21620.028048: balance_dirty_pages: bdi=8:0 weight=95% thresh=124232 gap=7656 dirtied=513 pause=36 bw=55612568
dd-4323  [007] 21620.067278: balance_dirty_pages: bdi=8:0 weight=95% thresh=124217 gap=7257 dirtied=513 pause=36 bw=52780982
dd-4323  [007] 21620.103783: balance_dirty_pages: bdi=8:0 weight=95% thresh=124219 gap=7387 dirtied=513 pause=36 bw=53787250
dd-4323  [003] 21620.143069: balance_dirty_pages: bdi=8:0 weight=84% thresh=125141 gap=7253 dirtied=513 pause=36 bw=53982296
dd-4323  [000] 21620.259771: balance_dirty_pages: bdi=8:0 weight=21% thresh=130291 gap=7955 dirtied=513 pause=36 bw=56894085
dd-4323  [004] 21620.295309: balance_dirty_pages: bdi=8:0 weight=22% thresh=130210 gap=7746 dirtied=513 pause=36 bw=55325100
dd-4323  [004] 21620.331046: balance_dirty_pages: bdi=8:0 weight=22% thresh=130145 gap=7425 dirtied=513 pause=36 bw=52955050
dd-4323  [004] 21620.367022: balance_dirty_pages: bdi=8:0 weight=23% thresh=130070 gap=7222 dirtied=513 pause=40 bw=51489214
dd-4323  [004] 21620.406877: balance_dirty_pages: bdi=8:0 weight=24% thresh=130004 gap=6900 dirtied=513 pause=40 bw=49086099
dd-4323  [004] 21620.446702: balance_dirty_pages: bdi=8:0 weight=25% thresh=129935 gap=6831 dirtied=513 pause=40 bw=48603064
dd-4323  [007] 21620.486673: balance_dirty_pages: bdi=8:0 weight=26% thresh=129873 gap=6641 dirtied=513 pause=44 bw=47142569
dd-4323  [007] 21620.530438: balance_dirty_pages: bdi=8:0 weight=26% thresh=129802 gap=6442 dirtied=513 pause=44 bw=45673274
dd-4323  [007] 21620.574312: balance_dirty_pages: bdi=8:0 weight=27% thresh=129743 gap=6415 dirtied=513 pause=44 bw=45466202
dd-4323  [007] 21620.618182: balance_dirty_pages: bdi=8:0 weight=28% thresh=129685 gap=6197 dirtied=513 pause=44 bw=43856286

I've not looked into this yet.. need to go to bed now :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 13/17] writeback: reduce per-bdi dirty threshold ramp up time
@ 2010-09-12 16:15     ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-12 16:15 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Peter Zijlstra, Richard Kennedy, Martin J. Bligh,
	Andrew Morton, Theodore Ts'o, Dave Chinner, Jan Kara,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Chris Mason,
	Christoph Hellwig, Li, Shaohua

On Sun, Sep 12, 2010 at 11:49:58PM +0800, Wu, Fengguang wrote:
> Reduce the dampening for the control system, yielding faster
> convergence.
> 
> Currently it converges at a snail's pace for slow devices (in order of
> minutes).  For really fast storage, the convergence speed should be fine.
> 
> It makes sense to make it reasonably fast for typical desktops.
> 
> After patch, it converges in ~10 seconds for 60MB/s writes and 4GB mem.
> So expect ~1s for a fast 600MB/s storage under 4GB mem, or ~4s under
> 16GB mem, which looks good.
> 
> $ while true; do grep BdiDirtyThresh /debug/bdi/8:0/stats; sleep 1; done
> BdiDirtyThresh:            0 kB
> BdiDirtyThresh:       118748 kB
> BdiDirtyThresh:       214280 kB
> BdiDirtyThresh:       303868 kB
> BdiDirtyThresh:       376528 kB
> BdiDirtyThresh:       411180 kB
> BdiDirtyThresh:       448636 kB
> BdiDirtyThresh:       472260 kB
> BdiDirtyThresh:       490924 kB
> BdiDirtyThresh:       499596 kB
> BdiDirtyThresh:       507068 kB
> ...
> DirtyThresh:          530392 kB

One related observation is, the task fraction may suddenly drop:

dd-4323  [004] 21608.535781: balance_dirty_pages: bdi=8:0 weight=97% thresh=124863 gap=7071 dirtied=513 pause=44 bw=44677838
dd-4323  [004] 21608.579568: balance_dirty_pages: bdi=8:0 weight=97% thresh=124851 gap=7315 dirtied=513 pause=44 bw=46321077
dd-4323  [004] 21608.623586: balance_dirty_pages: bdi=8:0 weight=97% thresh=124852 gap=7156 dirtied=513 pause=44 bw=45199674
dd-4323  [000] 21608.667526: balance_dirty_pages: bdi=8:0 weight=97% thresh=124853 gap=7029 dirtied=513 pause=44 bw=44337926
dd-4323  [000] 21608.711259: balance_dirty_pages: bdi=8:0 weight=97% thresh=124842 gap=7146 dirtied=513 pause=44 bw=45074728
dd-4323  [000] 21608.755051: balance_dirty_pages: bdi=8:0 weight=97% thresh=124843 gap=6891 dirtied=513 pause=48 bw=43356794
dd-4323  [000] 21608.802953: balance_dirty_pages: bdi=8:0 weight=97% thresh=124834 gap=6722 dirtied=513 pause=48 bw=42211067
dd-4323  [000] 21608.850745: balance_dirty_pages: bdi=8:0 weight=97% thresh=124834 gap=6594 dirtied=513 pause=48 bw=41326916
dd-4323  [004] 21608.900524: balance_dirty_pages: bdi=8:0 weight=62% thresh=127735 gap=7575 dirtied=513 pause=40 bw=47863047
dd-4323  [004] 21608.990461: balance_dirty_pages: bdi=8:0 weight=22% thresh=131040 gap=8064 dirtied=513 pause=40 bw=49548668
dd-4323  [004] 21609.030239: balance_dirty_pages: bdi=8:0 weight=23% thresh=130971 gap=7739 dirtied=513 pause=44 bw=47469455
dd-4323  [004] 21609.074075: balance_dirty_pages: bdi=8:0 weight=23% thresh=130915 gap=7427 dirtied=513 pause=44 bw=45455503
dd-4323  [004] 21609.117927: balance_dirty_pages: bdi=8:0 weight=24% thresh=130849 gap=7105 dirtied=513 pause=48 bw=43394683
dd-4323  [004] 21609.165843: balance_dirty_pages: bdi=8:0 weight=25% thresh=130770 gap=6898 dirtied=513 pause=48 bw=42071194
dd-4323  [004] 21609.213769: balance_dirty_pages: bdi=8:0 weight=26% thresh=130719 gap=7103 dirtied=513 pause=48 bw=43366955
dd-4323  [004] 21609.261483: balance_dirty_pages: bdi=8:0 weight=26% thresh=130655 gap=6911 dirtied=513 pause=48 bw=42130514
...
dd-4323  [001] 21619.473748: balance_dirty_pages: bdi=8:0 weight=96% thresh=124200 gap=7656 dirtied=513 pause=36 bw=55354531
dd-4323  [000] 21619.762110: balance_dirty_pages: bdi=8:0 weight=96% thresh=124148 gap=7540 dirtied=513 pause=36 bw=54586428
dd-4323  [000] 21619.804259: balance_dirty_pages: bdi=8:0 weight=96% thresh=124145 gap=7281 dirtied=513 pause=36 bw=52772359
dd-4323  [004] 21619.840740: balance_dirty_pages: bdi=8:0 weight=96% thresh=124133 gap=7397 dirtied=513 pause=36 bw=53627516
dd-4323  [004] 21619.876600: balance_dirty_pages: bdi=8:0 weight=96% thresh=124133 gap=7493 dirtied=513 pause=36 bw=54331060
dd-4323  [004] 21619.912482: balance_dirty_pages: bdi=8:0 weight=97% thresh=124133 gap=7621 dirtied=513 pause=36 bw=55266828
dd-4323  [007] 21619.955231: balance_dirty_pages: bdi=8:0 weight=95% thresh=124242 gap=7410 dirtied=513 pause=36 bw=53695642
dd-4323  [007] 21619.992100: balance_dirty_pages: bdi=8:0 weight=95% thresh=124246 gap=7542 dirtied=513 pause=36 bw=54714918
dd-4323  [007] 21620.028048: balance_dirty_pages: bdi=8:0 weight=95% thresh=124232 gap=7656 dirtied=513 pause=36 bw=55612568
dd-4323  [007] 21620.067278: balance_dirty_pages: bdi=8:0 weight=95% thresh=124217 gap=7257 dirtied=513 pause=36 bw=52780982
dd-4323  [007] 21620.103783: balance_dirty_pages: bdi=8:0 weight=95% thresh=124219 gap=7387 dirtied=513 pause=36 bw=53787250
dd-4323  [003] 21620.143069: balance_dirty_pages: bdi=8:0 weight=84% thresh=125141 gap=7253 dirtied=513 pause=36 bw=53982296
dd-4323  [000] 21620.259771: balance_dirty_pages: bdi=8:0 weight=21% thresh=130291 gap=7955 dirtied=513 pause=36 bw=56894085
dd-4323  [004] 21620.295309: balance_dirty_pages: bdi=8:0 weight=22% thresh=130210 gap=7746 dirtied=513 pause=36 bw=55325100
dd-4323  [004] 21620.331046: balance_dirty_pages: bdi=8:0 weight=22% thresh=130145 gap=7425 dirtied=513 pause=36 bw=52955050
dd-4323  [004] 21620.367022: balance_dirty_pages: bdi=8:0 weight=23% thresh=130070 gap=7222 dirtied=513 pause=40 bw=51489214
dd-4323  [004] 21620.406877: balance_dirty_pages: bdi=8:0 weight=24% thresh=130004 gap=6900 dirtied=513 pause=40 bw=49086099
dd-4323  [004] 21620.446702: balance_dirty_pages: bdi=8:0 weight=25% thresh=129935 gap=6831 dirtied=513 pause=40 bw=48603064
dd-4323  [007] 21620.486673: balance_dirty_pages: bdi=8:0 weight=26% thresh=129873 gap=6641 dirtied=513 pause=44 bw=47142569
dd-4323  [007] 21620.530438: balance_dirty_pages: bdi=8:0 weight=26% thresh=129802 gap=6442 dirtied=513 pause=44 bw=45673274
dd-4323  [007] 21620.574312: balance_dirty_pages: bdi=8:0 weight=27% thresh=129743 gap=6415 dirtied=513 pause=44 bw=45466202
dd-4323  [007] 21620.618182: balance_dirty_pages: bdi=8:0 weight=28% thresh=129685 gap=6197 dirtied=513 pause=44 bw=43856286

I've not looked into this yet.. need to go to bed now :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 05/17] writeback: quit throttling when signal pending
  2010-09-12 15:49   ` Wu Fengguang
@ 2010-09-12 20:46     ` Neil Brown
  -1 siblings, 0 replies; 98+ messages in thread
From: Neil Brown @ 2010-09-12 20:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-mm, LKML, Andrew Morton, Theodore Ts'o, Dave Chinner,
	Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

On Sun, 12 Sep 2010 23:49:50 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> This allows quick response to Ctrl-C etc. for impatient users.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2010-09-09 16:01:14.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-09-09 16:02:27.000000000 +0800
> @@ -553,6 +553,9 @@ static void balance_dirty_pages(struct a
>  		__set_current_state(TASK_INTERRUPTIBLE);
>  		io_schedule_timeout(pause);
>  
> +		if (signal_pending(current))
> +			break;
> +

Given the patch description,  I think you might want "fatal_signal_pending()"
here ???

NeilBrown

>  check_exceeded:
>  		/*
>  		 * The bdi thresh is somehow "soft" limit derived from the
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 05/17] writeback: quit throttling when signal pending
@ 2010-09-12 20:46     ` Neil Brown
  0 siblings, 0 replies; 98+ messages in thread
From: Neil Brown @ 2010-09-12 20:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-mm, LKML, Andrew Morton, Theodore Ts'o, Dave Chinner,
	Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

On Sun, 12 Sep 2010 23:49:50 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> This allows quick response to Ctrl-C etc. for impatient users.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2010-09-09 16:01:14.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-09-09 16:02:27.000000000 +0800
> @@ -553,6 +553,9 @@ static void balance_dirty_pages(struct a
>  		__set_current_state(TASK_INTERRUPTIBLE);
>  		io_schedule_timeout(pause);
>  
> +		if (signal_pending(current))
> +			break;
> +

Given the patch description,  I think you might want "fatal_signal_pending()"
here ???

NeilBrown

>  check_exceeded:
>  		/*
>  		 * The bdi thresh is somehow "soft" limit derived from the
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 05/17] writeback: quit throttling when signal pending
  2010-09-12 20:46     ` Neil Brown
@ 2010-09-13  1:55       ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-13  1:55 UTC (permalink / raw)
  To: Neil Brown
  Cc: linux-mm, LKML, Andrew Morton, Theodore Ts'o, Dave Chinner,
	Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li, Shaohua

On Mon, Sep 13, 2010 at 04:46:54AM +0800, Neil Brown wrote:
> On Sun, 12 Sep 2010 23:49:50 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > This allows quick response to Ctrl-C etc. for impatient users.
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/page-writeback.c |    3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > --- linux-next.orig/mm/page-writeback.c	2010-09-09 16:01:14.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2010-09-09 16:02:27.000000000 +0800
> > @@ -553,6 +553,9 @@ static void balance_dirty_pages(struct a
> >  		__set_current_state(TASK_INTERRUPTIBLE);
> >  		io_schedule_timeout(pause);
> >  
> > +		if (signal_pending(current))
> > +			break;
> > +
> 
> Given the patch description,  I think you might want "fatal_signal_pending()"
> here ???

__fatal_signal_pending() tests SIGKILL only, while the one often used
and need more quick responding is SIGINT..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 05/17] writeback: quit throttling when signal pending
@ 2010-09-13  1:55       ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-13  1:55 UTC (permalink / raw)
  To: Neil Brown
  Cc: linux-mm, LKML, Andrew Morton, Theodore Ts'o, Dave Chinner,
	Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li, Shaohua

On Mon, Sep 13, 2010 at 04:46:54AM +0800, Neil Brown wrote:
> On Sun, 12 Sep 2010 23:49:50 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > This allows quick response to Ctrl-C etc. for impatient users.
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/page-writeback.c |    3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > --- linux-next.orig/mm/page-writeback.c	2010-09-09 16:01:14.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2010-09-09 16:02:27.000000000 +0800
> > @@ -553,6 +553,9 @@ static void balance_dirty_pages(struct a
> >  		__set_current_state(TASK_INTERRUPTIBLE);
> >  		io_schedule_timeout(pause);
> >  
> > +		if (signal_pending(current))
> > +			break;
> > +
> 
> Given the patch description,  I think you might want "fatal_signal_pending()"
> here ???

__fatal_signal_pending() tests SIGKILL only, while the one often used
and need more quick responding is SIGINT..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 05/17] writeback: quit throttling when signal pending
  2010-09-13  1:55       ` Wu Fengguang
@ 2010-09-13  3:21         ` Neil Brown
  -1 siblings, 0 replies; 98+ messages in thread
From: Neil Brown @ 2010-09-13  3:21 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-mm, LKML, Andrew Morton, Theodore Ts'o, Dave Chinner,
	Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li, Shaohua

On Mon, 13 Sep 2010 09:55:29 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Mon, Sep 13, 2010 at 04:46:54AM +0800, Neil Brown wrote:
> > On Sun, 12 Sep 2010 23:49:50 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > This allows quick response to Ctrl-C etc. for impatient users.
> > > 
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  mm/page-writeback.c |    3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > --- linux-next.orig/mm/page-writeback.c	2010-09-09 16:01:14.000000000 +0800
> > > +++ linux-next/mm/page-writeback.c	2010-09-09 16:02:27.000000000 +0800
> > > @@ -553,6 +553,9 @@ static void balance_dirty_pages(struct a
> > >  		__set_current_state(TASK_INTERRUPTIBLE);
> > >  		io_schedule_timeout(pause);
> > >  
> > > +		if (signal_pending(current))
> > > +			break;
> > > +
> > 
> > Given the patch description,  I think you might want "fatal_signal_pending()"
> > here ???
> 
> __fatal_signal_pending() tests SIGKILL only, while the one often used
> and need more quick responding is SIGINT..
>

I thought that at first too....  but I don't think that is the case.

In kernel/signal.c, in complete_signal, we have
  if (sig_fatal() ...)
           ....
		sigaddset(&t->pending.signal, SIGKILL);

where sig_fatal is

#define sig_fatal(t, signr) \
	(!siginmask(signr, SIG_KERNEL_IGNORE_MASK|SIG_KERNEL_STOP_MASK) && \
	 (t)->sighand->action[(signr)-1].sa.sa_handler == SIG_DFL)


so (if I'm reading the code correctly), if a process receives a signal for
which the handler is SIG_DFL, then SIGKILL is set in the pending mask, so
__fatal_signal_pending will be true.

So it fatal_signal_pending should catch any signal that will cause the
process to exit.  I assume that it what you want...

NeilBrown

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 05/17] writeback: quit throttling when signal pending
@ 2010-09-13  3:21         ` Neil Brown
  0 siblings, 0 replies; 98+ messages in thread
From: Neil Brown @ 2010-09-13  3:21 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-mm, LKML, Andrew Morton, Theodore Ts'o, Dave Chinner,
	Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li,  Shaohua

On Mon, 13 Sep 2010 09:55:29 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Mon, Sep 13, 2010 at 04:46:54AM +0800, Neil Brown wrote:
> > On Sun, 12 Sep 2010 23:49:50 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > This allows quick response to Ctrl-C etc. for impatient users.
> > > 
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  mm/page-writeback.c |    3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > --- linux-next.orig/mm/page-writeback.c	2010-09-09 16:01:14.000000000 +0800
> > > +++ linux-next/mm/page-writeback.c	2010-09-09 16:02:27.000000000 +0800
> > > @@ -553,6 +553,9 @@ static void balance_dirty_pages(struct a
> > >  		__set_current_state(TASK_INTERRUPTIBLE);
> > >  		io_schedule_timeout(pause);
> > >  
> > > +		if (signal_pending(current))
> > > +			break;
> > > +
> > 
> > Given the patch description,  I think you might want "fatal_signal_pending()"
> > here ???
> 
> __fatal_signal_pending() tests SIGKILL only, while the one often used
> and need more quick responding is SIGINT..
>

I thought that at first too....  but I don't think that is the case.

In kernel/signal.c, in complete_signal, we have
  if (sig_fatal() ...)
           ....
		sigaddset(&t->pending.signal, SIGKILL);

where sig_fatal is

#define sig_fatal(t, signr) \
	(!siginmask(signr, SIG_KERNEL_IGNORE_MASK|SIG_KERNEL_STOP_MASK) && \
	 (t)->sighand->action[(signr)-1].sa.sa_handler == SIG_DFL)


so (if I'm reading the code correctly), if a process receives a signal for
which the handler is SIG_DFL, then SIGKILL is set in the pending mask, so
__fatal_signal_pending will be true.

So it fatal_signal_pending should catch any signal that will cause the
process to exit.  I assume that it what you want...

NeilBrown

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 05/17] writeback: quit throttling when signal pending
  2010-09-13  3:21         ` Neil Brown
@ 2010-09-13  3:48           ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-13  3:48 UTC (permalink / raw)
  To: Neil Brown
  Cc: linux-mm, LKML, Andrew Morton, Theodore Ts'o, Dave Chinner,
	Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li, Shaohua

On Mon, Sep 13, 2010 at 11:21:16AM +0800, Neil Brown wrote:
> On Mon, 13 Sep 2010 09:55:29 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On Mon, Sep 13, 2010 at 04:46:54AM +0800, Neil Brown wrote:
> > > On Sun, 12 Sep 2010 23:49:50 +0800
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > This allows quick response to Ctrl-C etc. for impatient users.
> > > > 
> > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > ---
> > > >  mm/page-writeback.c |    3 +++
> > > >  1 file changed, 3 insertions(+)
> > > > 
> > > > --- linux-next.orig/mm/page-writeback.c	2010-09-09 16:01:14.000000000 +0800
> > > > +++ linux-next/mm/page-writeback.c	2010-09-09 16:02:27.000000000 +0800
> > > > @@ -553,6 +553,9 @@ static void balance_dirty_pages(struct a
> > > >  		__set_current_state(TASK_INTERRUPTIBLE);
> > > >  		io_schedule_timeout(pause);
> > > >  
> > > > +		if (signal_pending(current))
> > > > +			break;
> > > > +
> > > 
> > > Given the patch description,  I think you might want "fatal_signal_pending()"
> > > here ???
> > 
> > __fatal_signal_pending() tests SIGKILL only, while the one often used
> > and need more quick responding is SIGINT..
> >
> 
> I thought that at first too....  but I don't think that is the case.
> 
> In kernel/signal.c, in complete_signal, we have
>   if (sig_fatal() ...)
>            ....
> 		sigaddset(&t->pending.signal, SIGKILL);
> 
> where sig_fatal is
> 
> #define sig_fatal(t, signr) \
> 	(!siginmask(signr, SIG_KERNEL_IGNORE_MASK|SIG_KERNEL_STOP_MASK) && \
> 	 (t)->sighand->action[(signr)-1].sa.sa_handler == SIG_DFL)
> 
> 
> so (if I'm reading the code correctly), if a process receives a signal for
> which the handler is SIG_DFL, then SIGKILL is set in the pending mask, so
> __fatal_signal_pending will be true.
> 
> So it fatal_signal_pending should catch any signal that will cause the
> process to exit.  I assume that it what you want...

Ah yes, it does look so. Thanks for the detailed explanation!
Here is the updated patch.

Thanks,
Fengguang
---
Subject: writeback: quit throttling when fatal signal pending
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Wed Sep 08 17:40:22 CST 2010

This allows quick response to Ctrl-C etc. for impatient users.

It mainly helps the rare bdi/global dirty exceeded cases.
In the normal case of not exceeded, it will quit the loop anyway. 

CC: Neil Brown <neilb@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    3 +++
 1 file changed, 3 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2010-09-12 13:25:23.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-09-13 11:39:33.000000000 +0800
@@ -552,6 +552,9 @@ static void balance_dirty_pages(struct a
 		__set_current_state(TASK_INTERRUPTIBLE);
 		io_schedule_timeout(pause);
 
+		if (fatal_signal_pending(current))
+			break;
+
 check_exceeded:
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 05/17] writeback: quit throttling when signal pending
@ 2010-09-13  3:48           ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-13  3:48 UTC (permalink / raw)
  To: Neil Brown
  Cc: linux-mm, LKML, Andrew Morton, Theodore Ts'o, Dave Chinner,
	Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li, Shaohua

On Mon, Sep 13, 2010 at 11:21:16AM +0800, Neil Brown wrote:
> On Mon, 13 Sep 2010 09:55:29 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On Mon, Sep 13, 2010 at 04:46:54AM +0800, Neil Brown wrote:
> > > On Sun, 12 Sep 2010 23:49:50 +0800
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > This allows quick response to Ctrl-C etc. for impatient users.
> > > > 
> > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > ---
> > > >  mm/page-writeback.c |    3 +++
> > > >  1 file changed, 3 insertions(+)
> > > > 
> > > > --- linux-next.orig/mm/page-writeback.c	2010-09-09 16:01:14.000000000 +0800
> > > > +++ linux-next/mm/page-writeback.c	2010-09-09 16:02:27.000000000 +0800
> > > > @@ -553,6 +553,9 @@ static void balance_dirty_pages(struct a
> > > >  		__set_current_state(TASK_INTERRUPTIBLE);
> > > >  		io_schedule_timeout(pause);
> > > >  
> > > > +		if (signal_pending(current))
> > > > +			break;
> > > > +
> > > 
> > > Given the patch description,  I think you might want "fatal_signal_pending()"
> > > here ???
> > 
> > __fatal_signal_pending() tests SIGKILL only, while the one often used
> > and need more quick responding is SIGINT..
> >
> 
> I thought that at first too....  but I don't think that is the case.
> 
> In kernel/signal.c, in complete_signal, we have
>   if (sig_fatal() ...)
>            ....
> 		sigaddset(&t->pending.signal, SIGKILL);
> 
> where sig_fatal is
> 
> #define sig_fatal(t, signr) \
> 	(!siginmask(signr, SIG_KERNEL_IGNORE_MASK|SIG_KERNEL_STOP_MASK) && \
> 	 (t)->sighand->action[(signr)-1].sa.sa_handler == SIG_DFL)
> 
> 
> so (if I'm reading the code correctly), if a process receives a signal for
> which the handler is SIG_DFL, then SIGKILL is set in the pending mask, so
> __fatal_signal_pending will be true.
> 
> So it fatal_signal_pending should catch any signal that will cause the
> process to exit.  I assume that it what you want...

Ah yes, it does look so. Thanks for the detailed explanation!
Here is the updated patch.

Thanks,
Fengguang
---
Subject: writeback: quit throttling when fatal signal pending
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Wed Sep 08 17:40:22 CST 2010

This allows quick response to Ctrl-C etc. for impatient users.

It mainly helps the rare bdi/global dirty exceeded cases.
In the normal case of not exceeded, it will quit the loop anyway. 

CC: Neil Brown <neilb@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    3 +++
 1 file changed, 3 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2010-09-12 13:25:23.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-09-13 11:39:33.000000000 +0800
@@ -552,6 +552,9 @@ static void balance_dirty_pages(struct a
 		__set_current_state(TASK_INTERRUPTIBLE);
 		io_schedule_timeout(pause);
 
+		if (fatal_signal_pending(current))
+			break;
+
 check_exceeded:
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 02/17] writeback: IO-less balance_dirty_pages()
  2010-09-12 15:49   ` Wu Fengguang
@ 2010-09-13  8:45     ` Dave Chinner
  -1 siblings, 0 replies; 98+ messages in thread
From: Dave Chinner @ 2010-09-13  8:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-mm, LKML, Chris Mason, Jan Kara, Peter Zijlstra,
	Jens Axboe, Andrew Morton, Theodore Ts'o, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Christoph Hellwig, Li Shaohua

On Sun, Sep 12, 2010 at 11:49:47PM +0800, Wu Fengguang wrote:
> As proposed by Chris, Dave and Jan, don't start foreground writeback IO
> inside balance_dirty_pages(). Instead, simply let it idle sleep for some
> time to throttle the dirtying task. In the mean while, kick off the
> per-bdi flusher thread to do background writeback IO.
> 
> This patch introduces the basic framework, which will be further
> consolidated by the next patches.

Can you put all this documentation into, say,
Documentation/filesystems/writeback-throttling-design.txt?

FWIW, I'm reading this and commenting without having looked at the
code - I want to understand the design, not the implementation ;)

> RATIONALS
> =========
> 
> The current balance_dirty_pages() is rather IO inefficient.
> 
> - concurrent writeback of multiple inodes (Dave Chinner)
> 
>   If every thread doing writes and being throttled start foreground
>   writeback, it leads to N IO submitters from at least N different
>   inodes at the same time, end up with N different sets of IO being
>   issued with potentially zero locality to each other, resulting in
>   much lower elevator sort/merge efficiency and hence we seek the disk
>   all over the place to service the different sets of IO.
>   OTOH, if there is only one submission thread, it doesn't jump between
>   inodes in the same way when congestion clears - it keeps writing to
>   the same inode, resulting in large related chunks of sequential IOs
>   being issued to the disk. This is more efficient than the above
>   foreground writeback because the elevator works better and the disk
>   seeks less.
> 
> - small nr_to_write for fast arrays
> 
>   The write_chunk used by current balance_dirty_pages() cannot be
>   directly set to some large value (eg. 128MB) for better IO efficiency.
>   Because it could lead to more than 1 second user perceivable stalls.
>   This limits current balance_dirty_pages() to small inefficient IOs.

Contrary to popular belief, I don't think nr_to_write is too small.
It's slow devices that cause problems with large chunks, not fast
arrays.

> For the above two reasons, it's much better to shift IO to the flusher
> threads and let balance_dirty_pages() just wait for enough time or progress.
> 
> Jan Kara, Dave Chinner and me explored the scheme to let
> balance_dirty_pages() wait for enough writeback IO completions to
> safeguard the dirty limit. This is found to have two problems:
> 
> - in large NUMA systems, the per-cpu counters may have big accounting
>   errors, leading to big throttle wait time and jitters.
> 
> - NFS may kill large amount of unstable pages with one single COMMIT.
>   Because NFS server serves COMMIT with expensive fsync() IOs, it is
>   desirable to delay and reduce the number of COMMITs. So it's not
>   likely to optimize away such kind of bursty IO completions, and the
>   resulted large (and tiny) stall times in IO completion based throttling.
> 
> So here is a pause time oriented approach, which tries to control
> 
> - the pause time in each balance_dirty_pages() invocations
> - the number of pages dirtied before calling balance_dirty_pages()
> 
> for smooth and efficient dirty throttling:
> 
> - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> - avoid too small pause time (less than  10ms, which burns CPU power)

For fast arrays, 10ms may be to high a lower bound. e.g. at 1GB/s,
10ms = 10MB written, at 10GB/s it is 100MB, so lower bounds for
faster arrays might be necessary to prevent unneccessarily long
wakeup latencies....

> CONTROL SYSTEM
> ==============
> 
> The current task_dirty_limit() adjusts bdi_thresh according to the dirty
> "weight" of the current task, which is the percent of pages recently
> dirtied by the task. If 100% pages are recently dirtied by the task, it
> will lower bdi_thresh by 1/8. If only 1% pages are dirtied by the task,
> it will return almost unmodified bdi_thresh. In this way, a heavy
> dirtier will get blocked at (bdi_thresh-bdi_thresh/8) while allowing a
> light dirtier to progress (the latter won't be blocked because R << B in
> fig.1).
> 
> Fig.1 before patch, a heavy dirtier and a light dirtier
>                                                 R
> ----------------------------------------------+-o---------------------------*--|
>                                               L A                           B  T
>   T: bdi_dirty_limit
>   L: bdi_dirty_limit - bdi_dirty_limit/8
> 
>   R: bdi_reclaimable + bdi_writeback
> 
>   A: bdi_thresh for a heavy dirtier ~= R ~= L
>   B: bdi_thresh for a light dirtier ~= T

Let me get your terminology straight:

	T = throttle threshold
	L = lower throttle bound
	R = reclaimable pages

	A/B: two dritying processes

> 
> If B is a newly started heavy dirtier, then it will slowly gain weight
> and A will lose weight.  The bdi_thresh for A and B will be approaching
> the center of region (L, T) and eventually stabilize there.
> 
> Fig.2 before patch, two heavy dirtiers converging to the same threshold
>                                                              R
> ----------------------------------------------+--------------o-*---------------|
>                                               L              A B               T
> 
> Now for IO-less balance_dirty_pages(), let's do it in a "bandwidth control"
> way. In fig.3, a soft dirty limit region (L, A) is introduced. When R enters
> this region, the task may be throttled for T seconds on every N pages it dirtied.
> Let's call (N/T) the "throttle bandwidth". It is computed by the following fomula:
> 

Now you've redefined R, L, T and A to mean completely different
things. That's kind of confusing, because you use them in similar
graphs

>         throttle_bandwidth = bdi_bandwidth * (A - R) / (A - L)
> where
>         L = A - A/16
> 	A = T - T/16

That means A and L are constants, so your algorithm comes down to
a first-order linear system:

	throttle_bandwidth = bdi_bandwidth * (15 - 16R/T)

that will only work in the range of 7/8T < R < 15/16T. That is,
for R < L, throttle bandwidth will be calculated to be greater than
bdi_bandwidth, and for R > A, throttle bandwidth will be negative.

> So when there is only one heavy dirtier (fig.3),
> 
>         R ~= L
>         throttle_bandwidth ~= bdi_bandwidth
> 
> It's a stable balance:
> - when R > L, then throttle_bandwidth < bdi_bandwidth, so R will decrease to L
> - when R < L, then throttle_bandwidth > bdi_bandwidth, so R will increase to L

That does not imply stability. First-order control algorithms are
generally unstable - they have trouble with convergence and tend to
overshoot and oscillate - because you can't easily control the rate
of change of the controlled variable.

> Fig.3 after patch, one heavy dirtier
> 
>                                                 |
>     throttle_bandwidth ~= bdi_bandwidth  =>     o
>                                                 | o
>                                                 |   o
>                                                 |     o
>                                                 |       o
>                                                 |         o
>                                               L |           o
> ----------------------------------------------+-+-------------o----------------|
>                                                 R             A                T
>   T: bdi_dirty_limit
>   A: task_dirty_limit = bdi_dirty_limit - bdi_dirty_limit/16
>   L: task_dirty_limit - task_dirty_limit/16
> 
>   R: bdi_reclaimable + bdi_writeback ~= L
> 
> When there comes a new cp task, its weight will grow from 0 to 50%.

While the other decreases from 100% to 50%? What causes this?

> When the weight is still small, it's considered a light dirtier and it's
> allowed to dirty pages much faster than the bdi write bandwidth. In fact
> initially it won't be throttled at all when R < Lb where Lb=B-B/16 and B~=T.

I'm missing something - if the task_dirty_limit is T/16, then the
the first task will have consumed all the dirty pages up to this
point (i.e. R ~= T/16). The then second task starts, and while it is
unthrottled, it will push R well past T. That will cause the first
task to throttle hard almost immediately, and effectively get
throttled until the weight of the second task passes the "heavy"
threshold.  The first task won't get unthrottled until R passes back
down below T. That seems undesirable....

> Fig.4 after patch, an old cp + a newly started cp
> 
>                      (throttle bandwidth) =>    *
>                                                 | *
>                                                 |   *
>                                                 |     *
>                                                 |       *
>                                                 |         *
>                                                 |           *
>                                                 |             *
>                       throttle bandwidth  =>    o               *
>                                                 | o               *
>                                                 |   o               *
>                                                 |     o               *
>                                                 |       o               *
>                                                 |         o               *
>                                                 |           o               *
> ------------------------------------------------+-------------o---------------*|
>                                                 R             A               BT
> 
> So R will quickly grow large (fig.5). As the two heavy dirtiers' weight
> converge to 50%, the points A, B will go towards each other and

This assumes that the two processes are reaching equal amount sof
dirty pages in the page cache? (weight is not defined anywhere, so I
can't tell from reading the document how it is calculated)

> eventually become one in fig.5. R will stabilize around A-A/32 where
> A=B=T-T/16. throttle_bandwidth will stabilize around bdi_bandwidth/2.

Why? You haven't explained how weight affects any of the defined
variables

> There won't be big oscillations between A and B, because as long as A
> coincides with B, their throttle_bandwidth and dirtied pages will be
> equal, A's weight will stop decreasing and B's weight will stop growing,
> so the two points won't keep moving and cross each other. So it's a
> pretty stable control system. The only problem is, it converges a bit
> slow (except for really fast storage array).

Convergence should really be independent of the write speed,
otherwise we'll be forever trying to find the "best" value for
different configurations.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 02/17] writeback: IO-less balance_dirty_pages()
@ 2010-09-13  8:45     ` Dave Chinner
  0 siblings, 0 replies; 98+ messages in thread
From: Dave Chinner @ 2010-09-13  8:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-mm, LKML, Chris Mason, Jan Kara, Peter Zijlstra,
	Jens Axboe, Andrew Morton, Theodore Ts'o, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Christoph Hellwig, Li Shaohua

On Sun, Sep 12, 2010 at 11:49:47PM +0800, Wu Fengguang wrote:
> As proposed by Chris, Dave and Jan, don't start foreground writeback IO
> inside balance_dirty_pages(). Instead, simply let it idle sleep for some
> time to throttle the dirtying task. In the mean while, kick off the
> per-bdi flusher thread to do background writeback IO.
> 
> This patch introduces the basic framework, which will be further
> consolidated by the next patches.

Can you put all this documentation into, say,
Documentation/filesystems/writeback-throttling-design.txt?

FWIW, I'm reading this and commenting without having looked at the
code - I want to understand the design, not the implementation ;)

> RATIONALS
> =========
> 
> The current balance_dirty_pages() is rather IO inefficient.
> 
> - concurrent writeback of multiple inodes (Dave Chinner)
> 
>   If every thread doing writes and being throttled start foreground
>   writeback, it leads to N IO submitters from at least N different
>   inodes at the same time, end up with N different sets of IO being
>   issued with potentially zero locality to each other, resulting in
>   much lower elevator sort/merge efficiency and hence we seek the disk
>   all over the place to service the different sets of IO.
>   OTOH, if there is only one submission thread, it doesn't jump between
>   inodes in the same way when congestion clears - it keeps writing to
>   the same inode, resulting in large related chunks of sequential IOs
>   being issued to the disk. This is more efficient than the above
>   foreground writeback because the elevator works better and the disk
>   seeks less.
> 
> - small nr_to_write for fast arrays
> 
>   The write_chunk used by current balance_dirty_pages() cannot be
>   directly set to some large value (eg. 128MB) for better IO efficiency.
>   Because it could lead to more than 1 second user perceivable stalls.
>   This limits current balance_dirty_pages() to small inefficient IOs.

Contrary to popular belief, I don't think nr_to_write is too small.
It's slow devices that cause problems with large chunks, not fast
arrays.

> For the above two reasons, it's much better to shift IO to the flusher
> threads and let balance_dirty_pages() just wait for enough time or progress.
> 
> Jan Kara, Dave Chinner and me explored the scheme to let
> balance_dirty_pages() wait for enough writeback IO completions to
> safeguard the dirty limit. This is found to have two problems:
> 
> - in large NUMA systems, the per-cpu counters may have big accounting
>   errors, leading to big throttle wait time and jitters.
> 
> - NFS may kill large amount of unstable pages with one single COMMIT.
>   Because NFS server serves COMMIT with expensive fsync() IOs, it is
>   desirable to delay and reduce the number of COMMITs. So it's not
>   likely to optimize away such kind of bursty IO completions, and the
>   resulted large (and tiny) stall times in IO completion based throttling.
> 
> So here is a pause time oriented approach, which tries to control
> 
> - the pause time in each balance_dirty_pages() invocations
> - the number of pages dirtied before calling balance_dirty_pages()
> 
> for smooth and efficient dirty throttling:
> 
> - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> - avoid too small pause time (less than  10ms, which burns CPU power)

For fast arrays, 10ms may be to high a lower bound. e.g. at 1GB/s,
10ms = 10MB written, at 10GB/s it is 100MB, so lower bounds for
faster arrays might be necessary to prevent unneccessarily long
wakeup latencies....

> CONTROL SYSTEM
> ==============
> 
> The current task_dirty_limit() adjusts bdi_thresh according to the dirty
> "weight" of the current task, which is the percent of pages recently
> dirtied by the task. If 100% pages are recently dirtied by the task, it
> will lower bdi_thresh by 1/8. If only 1% pages are dirtied by the task,
> it will return almost unmodified bdi_thresh. In this way, a heavy
> dirtier will get blocked at (bdi_thresh-bdi_thresh/8) while allowing a
> light dirtier to progress (the latter won't be blocked because R << B in
> fig.1).
> 
> Fig.1 before patch, a heavy dirtier and a light dirtier
>                                                 R
> ----------------------------------------------+-o---------------------------*--|
>                                               L A                           B  T
>   T: bdi_dirty_limit
>   L: bdi_dirty_limit - bdi_dirty_limit/8
> 
>   R: bdi_reclaimable + bdi_writeback
> 
>   A: bdi_thresh for a heavy dirtier ~= R ~= L
>   B: bdi_thresh for a light dirtier ~= T

Let me get your terminology straight:

	T = throttle threshold
	L = lower throttle bound
	R = reclaimable pages

	A/B: two dritying processes

> 
> If B is a newly started heavy dirtier, then it will slowly gain weight
> and A will lose weight.  The bdi_thresh for A and B will be approaching
> the center of region (L, T) and eventually stabilize there.
> 
> Fig.2 before patch, two heavy dirtiers converging to the same threshold
>                                                              R
> ----------------------------------------------+--------------o-*---------------|
>                                               L              A B               T
> 
> Now for IO-less balance_dirty_pages(), let's do it in a "bandwidth control"
> way. In fig.3, a soft dirty limit region (L, A) is introduced. When R enters
> this region, the task may be throttled for T seconds on every N pages it dirtied.
> Let's call (N/T) the "throttle bandwidth". It is computed by the following fomula:
> 

Now you've redefined R, L, T and A to mean completely different
things. That's kind of confusing, because you use them in similar
graphs

>         throttle_bandwidth = bdi_bandwidth * (A - R) / (A - L)
> where
>         L = A - A/16
> 	A = T - T/16

That means A and L are constants, so your algorithm comes down to
a first-order linear system:

	throttle_bandwidth = bdi_bandwidth * (15 - 16R/T)

that will only work in the range of 7/8T < R < 15/16T. That is,
for R < L, throttle bandwidth will be calculated to be greater than
bdi_bandwidth, and for R > A, throttle bandwidth will be negative.

> So when there is only one heavy dirtier (fig.3),
> 
>         R ~= L
>         throttle_bandwidth ~= bdi_bandwidth
> 
> It's a stable balance:
> - when R > L, then throttle_bandwidth < bdi_bandwidth, so R will decrease to L
> - when R < L, then throttle_bandwidth > bdi_bandwidth, so R will increase to L

That does not imply stability. First-order control algorithms are
generally unstable - they have trouble with convergence and tend to
overshoot and oscillate - because you can't easily control the rate
of change of the controlled variable.

> Fig.3 after patch, one heavy dirtier
> 
>                                                 |
>     throttle_bandwidth ~= bdi_bandwidth  =>     o
>                                                 | o
>                                                 |   o
>                                                 |     o
>                                                 |       o
>                                                 |         o
>                                               L |           o
> ----------------------------------------------+-+-------------o----------------|
>                                                 R             A                T
>   T: bdi_dirty_limit
>   A: task_dirty_limit = bdi_dirty_limit - bdi_dirty_limit/16
>   L: task_dirty_limit - task_dirty_limit/16
> 
>   R: bdi_reclaimable + bdi_writeback ~= L
> 
> When there comes a new cp task, its weight will grow from 0 to 50%.

While the other decreases from 100% to 50%? What causes this?

> When the weight is still small, it's considered a light dirtier and it's
> allowed to dirty pages much faster than the bdi write bandwidth. In fact
> initially it won't be throttled at all when R < Lb where Lb=B-B/16 and B~=T.

I'm missing something - if the task_dirty_limit is T/16, then the
the first task will have consumed all the dirty pages up to this
point (i.e. R ~= T/16). The then second task starts, and while it is
unthrottled, it will push R well past T. That will cause the first
task to throttle hard almost immediately, and effectively get
throttled until the weight of the second task passes the "heavy"
threshold.  The first task won't get unthrottled until R passes back
down below T. That seems undesirable....

> Fig.4 after patch, an old cp + a newly started cp
> 
>                      (throttle bandwidth) =>    *
>                                                 | *
>                                                 |   *
>                                                 |     *
>                                                 |       *
>                                                 |         *
>                                                 |           *
>                                                 |             *
>                       throttle bandwidth  =>    o               *
>                                                 | o               *
>                                                 |   o               *
>                                                 |     o               *
>                                                 |       o               *
>                                                 |         o               *
>                                                 |           o               *
> ------------------------------------------------+-------------o---------------*|
>                                                 R             A               BT
> 
> So R will quickly grow large (fig.5). As the two heavy dirtiers' weight
> converge to 50%, the points A, B will go towards each other and

This assumes that the two processes are reaching equal amount sof
dirty pages in the page cache? (weight is not defined anywhere, so I
can't tell from reading the document how it is calculated)

> eventually become one in fig.5. R will stabilize around A-A/32 where
> A=B=T-T/16. throttle_bandwidth will stabilize around bdi_bandwidth/2.

Why? You haven't explained how weight affects any of the defined
variables

> There won't be big oscillations between A and B, because as long as A
> coincides with B, their throttle_bandwidth and dirtied pages will be
> equal, A's weight will stop decreasing and B's weight will stop growing,
> so the two points won't keep moving and cross each other. So it's a
> pretty stable control system. The only problem is, it converges a bit
> slow (except for really fast storage array).

Convergence should really be independent of the write speed,
otherwise we'll be forever trying to find the "best" value for
different configurations.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/17] writeback: remove the internal 5% low bound on dirty_ratio
  2010-09-12 15:49   ` Wu Fengguang
@ 2010-09-13  9:23     ` Johannes Weiner
  -1 siblings, 0 replies; 98+ messages in thread
From: Johannes Weiner @ 2010-09-13  9:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-mm, LKML, Jan Kara, Peter Zijlstra, Andrew Morton,
	Theodore Ts'o, Dave Chinner, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

On Sun, Sep 12, 2010 at 11:49:46PM +0800, Wu Fengguang wrote:
> The dirty_ratio was siliently limited in global_dirty_limits() to >= 5%.
> This is not a user expected behavior. And it's inconsistent with
> calc_period_shift(), which uses the plain vm_dirty_ratio value.
> 
> Let's rip the arbitrary internal bound. It may impact some very weird
> user space applications. However we are going to dynamicly sizing the
> dirty limits anyway, which may well break such applications, too.
> 
> At the same time, fix balance_dirty_pages() to work with the
> dirty_thresh=0 case. This allows applications to proceed when
> dirty+writeback pages are all cleaned.
> 
> And ">" fits with the name "exceeded" better than ">=" does. Neil
> think it is an aesthetic improvement as well as a functional one :)
> 
> CC: Jan Kara <jack@suse.cz>
> Proposed-by: Con Kolivas <kernel@kolivas.org>
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Neil Brown <neilb@suse.de>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/17] writeback: remove the internal 5% low bound on dirty_ratio
@ 2010-09-13  9:23     ` Johannes Weiner
  0 siblings, 0 replies; 98+ messages in thread
From: Johannes Weiner @ 2010-09-13  9:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-mm, LKML, Jan Kara, Peter Zijlstra, Andrew Morton,
	Theodore Ts'o, Dave Chinner, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

On Sun, Sep 12, 2010 at 11:49:46PM +0800, Wu Fengguang wrote:
> The dirty_ratio was siliently limited in global_dirty_limits() to >= 5%.
> This is not a user expected behavior. And it's inconsistent with
> calc_period_shift(), which uses the plain vm_dirty_ratio value.
> 
> Let's rip the arbitrary internal bound. It may impact some very weird
> user space applications. However we are going to dynamicly sizing the
> dirty limits anyway, which may well break such applications, too.
> 
> At the same time, fix balance_dirty_pages() to work with the
> dirty_thresh=0 case. This allows applications to proceed when
> dirty+writeback pages are all cleaned.
> 
> And ">" fits with the name "exceeded" better than ">=" does. Neil
> think it is an aesthetic improvement as well as a functional one :)
> 
> CC: Jan Kara <jack@suse.cz>
> Proposed-by: Con Kolivas <kernel@kolivas.org>
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Neil Brown <neilb@suse.de>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 15/17] mm: lower soft dirty limits on memory pressure
  2010-09-12 15:50   ` Wu Fengguang
@ 2010-09-13  9:40     ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-13  9:40 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Dave Chinner, Andrew Morton, Theodore Ts'o, Jan Kara,
	Peter Zijlstra, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Chris Mason, Christoph Hellwig, Li, Shaohua

>  		if (PageDirty(page)) {
> +
> +			if (file && scanning_global_lru(sc)) {

Oops "file" does not exist in linux-next. Could use
"page_is_file_cache(page)" instead to avoid the compile error.

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 15/17] mm: lower soft dirty limits on memory pressure
@ 2010-09-13  9:40     ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-13  9:40 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Dave Chinner, Andrew Morton, Theodore Ts'o, Jan Kara,
	Peter Zijlstra, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Chris Mason, Christoph Hellwig, Li, Shaohua

>  		if (PageDirty(page)) {
> +
> +			if (file && scanning_global_lru(sc)) {

Oops "file" does not exist in linux-next. Could use
"page_is_file_cache(page)" instead to avoid the compile error.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/17] writeback: remove the internal 5% low bound on dirty_ratio
  2010-09-12 15:49   ` Wu Fengguang
@ 2010-09-13  9:51     ` Mel Gorman
  -1 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-09-13  9:51 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-mm, LKML, Jan Kara, Peter Zijlstra, Andrew Morton,
	Theodore Ts'o, Dave Chinner, Rik van Riel, KOSAKI Motohiro,
	Chris Mason, Christoph Hellwig, Li Shaohua

On Sun, Sep 12, 2010 at 11:49:46PM +0800, Wu Fengguang wrote:
> The dirty_ratio was siliently limited in global_dirty_limits() to >= 5%.
> This is not a user expected behavior. And it's inconsistent with
> calc_period_shift(), which uses the plain vm_dirty_ratio value.
> 
> Let's rip the arbitrary internal bound. It may impact some very weird
> user space applications. However we are going to dynamicly sizing the
> dirty limits anyway, which may well break such applications, too.
> 
> At the same time, fix balance_dirty_pages() to work with the
> dirty_thresh=0 case. This allows applications to proceed when
> dirty+writeback pages are all cleaned.
> 
> And ">" fits with the name "exceeded" better than ">=" does. Neil
> think it is an aesthetic improvement as well as a functional one :)
> 
> CC: Jan Kara <jack@suse.cz>
> Proposed-by: Con Kolivas <kernel@kolivas.org>
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Neil Brown <neilb@suse.de>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c   |    2 +-
>  mm/page-writeback.c |   16 +++++-----------
>  2 files changed, 6 insertions(+), 12 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2010-08-29 08:10:30.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-08-29 08:12:08.000000000 +0800
> @@ -415,14 +415,8 @@ void global_dirty_limits(unsigned long *
>  
>  	if (vm_dirty_bytes)
>  		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> -	else {
> -		int dirty_ratio;
> -
> -		dirty_ratio = vm_dirty_ratio;
> -		if (dirty_ratio < 5)
> -			dirty_ratio = 5;
> -		dirty = (dirty_ratio * available_memory) / 100;
> -	}
> +	else
> +		dirty = (vm_dirty_ratio * available_memory) / 100;
>  

What kernel is this? In a recent mainline kernel and on linux-next, this
is

dirty = (dirty_ratio * available_memory) / 100;

i.e. * instead of +. With +, the value for dirty is almost always going
to be simply 1%.

>  	if (dirty_background_bytes)
>  		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> @@ -510,7 +504,7 @@ static void balance_dirty_pages(struct a
>  		 * catch-up. This avoids (excessively) small writeouts
>  		 * when the bdi limits are ramping up.
>  		 */
> -		if (nr_reclaimable + nr_writeback <
> +		if (nr_reclaimable + nr_writeback <=
>  				(background_thresh + dirty_thresh) / 2)
>  			break;
>  
> @@ -542,8 +536,8 @@ static void balance_dirty_pages(struct a
>  		 * the last resort safeguard.
>  		 */
>  		dirty_exceeded =
> -			(bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
> -			|| (nr_reclaimable + nr_writeback >= dirty_thresh);
> +			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
> +			|| (nr_reclaimable + nr_writeback > dirty_thresh);
>  
>  		if (!dirty_exceeded)
>  			break;
> --- linux-next.orig/fs/fs-writeback.c	2010-08-29 08:12:51.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-08-29 08:12:53.000000000 +0800
> @@ -574,7 +574,7 @@ static inline bool over_bground_thresh(v
>  	global_dirty_limits(&background_thresh, &dirty_thresh);
>  
>  	return (global_page_state(NR_FILE_DIRTY) +
> -		global_page_state(NR_UNSTABLE_NFS) >= background_thresh);
> +		global_page_state(NR_UNSTABLE_NFS) > background_thresh);
>  }
>  
>  /*
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/17] writeback: remove the internal 5% low bound on dirty_ratio
@ 2010-09-13  9:51     ` Mel Gorman
  0 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-09-13  9:51 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-mm, LKML, Jan Kara, Peter Zijlstra, Andrew Morton,
	Theodore Ts'o, Dave Chinner, Rik van Riel, KOSAKI Motohiro,
	Chris Mason, Christoph Hellwig, Li Shaohua

On Sun, Sep 12, 2010 at 11:49:46PM +0800, Wu Fengguang wrote:
> The dirty_ratio was siliently limited in global_dirty_limits() to >= 5%.
> This is not a user expected behavior. And it's inconsistent with
> calc_period_shift(), which uses the plain vm_dirty_ratio value.
> 
> Let's rip the arbitrary internal bound. It may impact some very weird
> user space applications. However we are going to dynamicly sizing the
> dirty limits anyway, which may well break such applications, too.
> 
> At the same time, fix balance_dirty_pages() to work with the
> dirty_thresh=0 case. This allows applications to proceed when
> dirty+writeback pages are all cleaned.
> 
> And ">" fits with the name "exceeded" better than ">=" does. Neil
> think it is an aesthetic improvement as well as a functional one :)
> 
> CC: Jan Kara <jack@suse.cz>
> Proposed-by: Con Kolivas <kernel@kolivas.org>
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Neil Brown <neilb@suse.de>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c   |    2 +-
>  mm/page-writeback.c |   16 +++++-----------
>  2 files changed, 6 insertions(+), 12 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2010-08-29 08:10:30.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-08-29 08:12:08.000000000 +0800
> @@ -415,14 +415,8 @@ void global_dirty_limits(unsigned long *
>  
>  	if (vm_dirty_bytes)
>  		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> -	else {
> -		int dirty_ratio;
> -
> -		dirty_ratio = vm_dirty_ratio;
> -		if (dirty_ratio < 5)
> -			dirty_ratio = 5;
> -		dirty = (dirty_ratio * available_memory) / 100;
> -	}
> +	else
> +		dirty = (vm_dirty_ratio * available_memory) / 100;
>  

What kernel is this? In a recent mainline kernel and on linux-next, this
is

dirty = (dirty_ratio * available_memory) / 100;

i.e. * instead of +. With +, the value for dirty is almost always going
to be simply 1%.

>  	if (dirty_background_bytes)
>  		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> @@ -510,7 +504,7 @@ static void balance_dirty_pages(struct a
>  		 * catch-up. This avoids (excessively) small writeouts
>  		 * when the bdi limits are ramping up.
>  		 */
> -		if (nr_reclaimable + nr_writeback <
> +		if (nr_reclaimable + nr_writeback <=
>  				(background_thresh + dirty_thresh) / 2)
>  			break;
>  
> @@ -542,8 +536,8 @@ static void balance_dirty_pages(struct a
>  		 * the last resort safeguard.
>  		 */
>  		dirty_exceeded =
> -			(bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
> -			|| (nr_reclaimable + nr_writeback >= dirty_thresh);
> +			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
> +			|| (nr_reclaimable + nr_writeback > dirty_thresh);
>  
>  		if (!dirty_exceeded)
>  			break;
> --- linux-next.orig/fs/fs-writeback.c	2010-08-29 08:12:51.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-08-29 08:12:53.000000000 +0800
> @@ -574,7 +574,7 @@ static inline bool over_bground_thresh(v
>  	global_dirty_limits(&background_thresh, &dirty_thresh);
>  
>  	return (global_page_state(NR_FILE_DIRTY) +
> -		global_page_state(NR_UNSTABLE_NFS) >= background_thresh);
> +		global_page_state(NR_UNSTABLE_NFS) > background_thresh);
>  }
>  
>  /*
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/17] writeback: remove the internal 5% low bound on dirty_ratio
  2010-09-13  9:51     ` Mel Gorman
@ 2010-09-13  9:57       ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-13  9:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, LKML, Jan Kara, Peter Zijlstra, Andrew Morton,
	Theodore Ts'o, Dave Chinner, Rik van Riel, KOSAKI Motohiro,
	Chris Mason, Christoph Hellwig, Li, Shaohua

On Mon, Sep 13, 2010 at 05:51:30PM +0800, Mel Gorman wrote:
> On Sun, Sep 12, 2010 at 11:49:46PM +0800, Wu Fengguang wrote:
> > The dirty_ratio was siliently limited in global_dirty_limits() to >= 5%.
> > This is not a user expected behavior. And it's inconsistent with
> > calc_period_shift(), which uses the plain vm_dirty_ratio value.
> > 
> > Let's rip the arbitrary internal bound. It may impact some very weird
> > user space applications. However we are going to dynamicly sizing the
> > dirty limits anyway, which may well break such applications, too.
> > 
> > At the same time, fix balance_dirty_pages() to work with the
> > dirty_thresh=0 case. This allows applications to proceed when
> > dirty+writeback pages are all cleaned.
> > 
> > And ">" fits with the name "exceeded" better than ">=" does. Neil
> > think it is an aesthetic improvement as well as a functional one :)
> > 
> > CC: Jan Kara <jack@suse.cz>
> > Proposed-by: Con Kolivas <kernel@kolivas.org>
> > Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> > Reviewed-by: Neil Brown <neilb@suse.de>
> > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/fs-writeback.c   |    2 +-
> >  mm/page-writeback.c |   16 +++++-----------
> >  2 files changed, 6 insertions(+), 12 deletions(-)
> > 
> > --- linux-next.orig/mm/page-writeback.c	2010-08-29 08:10:30.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2010-08-29 08:12:08.000000000 +0800
> > @@ -415,14 +415,8 @@ void global_dirty_limits(unsigned long *
> >  
> >  	if (vm_dirty_bytes)
> >  		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> > -	else {
> > -		int dirty_ratio;
> > -
> > -		dirty_ratio = vm_dirty_ratio;
> > -		if (dirty_ratio < 5)
> > -			dirty_ratio = 5;
> > -		dirty = (dirty_ratio * available_memory) / 100;
> > -	}
> > +	else
> > +		dirty = (vm_dirty_ratio * available_memory) / 100;
> >  
> 
> What kernel is this? In a recent mainline kernel and on linux-next, this
> is

It applies to linux-next 20100903.

> dirty = (dirty_ratio * available_memory) / 100;
> 
> i.e. * instead of +. With +, the value for dirty is almost always going
> to be simply 1%.

Where's the "+" come from?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/17] writeback: remove the internal 5% low bound on dirty_ratio
@ 2010-09-13  9:57       ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-13  9:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, LKML, Jan Kara, Peter Zijlstra, Andrew Morton,
	Theodore Ts'o, Dave Chinner, Rik van Riel, KOSAKI Motohiro,
	Chris Mason, Christoph Hellwig, Li, Shaohua

On Mon, Sep 13, 2010 at 05:51:30PM +0800, Mel Gorman wrote:
> On Sun, Sep 12, 2010 at 11:49:46PM +0800, Wu Fengguang wrote:
> > The dirty_ratio was siliently limited in global_dirty_limits() to >= 5%.
> > This is not a user expected behavior. And it's inconsistent with
> > calc_period_shift(), which uses the plain vm_dirty_ratio value.
> > 
> > Let's rip the arbitrary internal bound. It may impact some very weird
> > user space applications. However we are going to dynamicly sizing the
> > dirty limits anyway, which may well break such applications, too.
> > 
> > At the same time, fix balance_dirty_pages() to work with the
> > dirty_thresh=0 case. This allows applications to proceed when
> > dirty+writeback pages are all cleaned.
> > 
> > And ">" fits with the name "exceeded" better than ">=" does. Neil
> > think it is an aesthetic improvement as well as a functional one :)
> > 
> > CC: Jan Kara <jack@suse.cz>
> > Proposed-by: Con Kolivas <kernel@kolivas.org>
> > Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> > Reviewed-by: Neil Brown <neilb@suse.de>
> > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/fs-writeback.c   |    2 +-
> >  mm/page-writeback.c |   16 +++++-----------
> >  2 files changed, 6 insertions(+), 12 deletions(-)
> > 
> > --- linux-next.orig/mm/page-writeback.c	2010-08-29 08:10:30.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2010-08-29 08:12:08.000000000 +0800
> > @@ -415,14 +415,8 @@ void global_dirty_limits(unsigned long *
> >  
> >  	if (vm_dirty_bytes)
> >  		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> > -	else {
> > -		int dirty_ratio;
> > -
> > -		dirty_ratio = vm_dirty_ratio;
> > -		if (dirty_ratio < 5)
> > -			dirty_ratio = 5;
> > -		dirty = (dirty_ratio * available_memory) / 100;
> > -	}
> > +	else
> > +		dirty = (vm_dirty_ratio * available_memory) / 100;
> >  
> 
> What kernel is this? In a recent mainline kernel and on linux-next, this
> is

It applies to linux-next 20100903.

> dirty = (dirty_ratio * available_memory) / 100;
> 
> i.e. * instead of +. With +, the value for dirty is almost always going
> to be simply 1%.

Where's the "+" come from?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/17] writeback: remove the internal 5% low bound on dirty_ratio
  2010-09-13  9:57       ` Wu Fengguang
@ 2010-09-13 10:10         ` Mel Gorman
  -1 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-09-13 10:10 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-mm, LKML, Jan Kara, Peter Zijlstra, Andrew Morton,
	Theodore Ts'o, Dave Chinner, Rik van Riel, KOSAKI Motohiro,
	Chris Mason, Christoph Hellwig, Li, Shaohua

> > 
> > i.e. * instead of +. With +, the value for dirty is almost always going
> > to be simply 1%.
> 
> Where's the "+" come from?
> 

This is embarassing. I was reading mail on a small font that had reduced all *
to look like +. Ignore the question.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/17] writeback: remove the internal 5% low bound on dirty_ratio
@ 2010-09-13 10:10         ` Mel Gorman
  0 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-09-13 10:10 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-mm, LKML, Jan Kara, Peter Zijlstra, Andrew Morton,
	Theodore Ts'o, Dave Chinner, Rik van Riel, KOSAKI Motohiro,
	Chris Mason, Christoph Hellwig, Li, Shaohua

> > 
> > i.e. * instead of +. With +, the value for dirty is almost always going
> > to be simply 1%.
> 
> Where's the "+" come from?
> 

This is embarassing. I was reading mail on a small font that had reduced all *
to look like +. Ignore the question.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 02/17] writeback: IO-less balance_dirty_pages()
  2010-09-13  8:45     ` Dave Chinner
@ 2010-09-13 11:38       ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-13 11:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-mm, LKML, Chris Mason, Jan Kara, Peter Zijlstra,
	Jens Axboe, Andrew Morton, Theodore Ts'o, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Christoph Hellwig, Li, Shaohua

On Mon, Sep 13, 2010 at 04:45:34PM +0800, Dave Chinner wrote:
> On Sun, Sep 12, 2010 at 11:49:47PM +0800, Wu Fengguang wrote:
> > As proposed by Chris, Dave and Jan, don't start foreground writeback IO
> > inside balance_dirty_pages(). Instead, simply let it idle sleep for some
> > time to throttle the dirtying task. In the mean while, kick off the
> > per-bdi flusher thread to do background writeback IO.
> > 
> > This patch introduces the basic framework, which will be further
> > consolidated by the next patches.
> 
> Can you put all this documentation into, say,
> Documentation/filesystems/writeback-throttling-design.txt?

OK.

> FWIW, I'm reading this and commenting without having looked at the
> code - I want to understand the design, not the implementation ;)

To get a further understanding of the dynamics, you are advised to run
it on a fast storage and check out the traces as shown in patch 07 ;)

> > RATIONALS
> > =========
> > 
> > The current balance_dirty_pages() is rather IO inefficient.
> > 
> > - concurrent writeback of multiple inodes (Dave Chinner)
> > 
> >   If every thread doing writes and being throttled start foreground
> >   writeback, it leads to N IO submitters from at least N different
> >   inodes at the same time, end up with N different sets of IO being
> >   issued with potentially zero locality to each other, resulting in
> >   much lower elevator sort/merge efficiency and hence we seek the disk
> >   all over the place to service the different sets of IO.
> >   OTOH, if there is only one submission thread, it doesn't jump between
> >   inodes in the same way when congestion clears - it keeps writing to
> >   the same inode, resulting in large related chunks of sequential IOs
> >   being issued to the disk. This is more efficient than the above
> >   foreground writeback because the elevator works better and the disk
> >   seeks less.
> > 
> > - small nr_to_write for fast arrays
> > 
> >   The write_chunk used by current balance_dirty_pages() cannot be
> >   directly set to some large value (eg. 128MB) for better IO efficiency.
> >   Because it could lead to more than 1 second user perceivable stalls.
> >   This limits current balance_dirty_pages() to small inefficient IOs.
> 
> Contrary to popular belief, I don't think nr_to_write is too small.
> It's slow devices that cause problems with large chunks, not fast
> arrays.

Then we have another merit "shorter stall time for slow devices" :)
This algorithm is able to adapt to reasonable pause time for both fast
and slow devices.

> > For the above two reasons, it's much better to shift IO to the flusher
> > threads and let balance_dirty_pages() just wait for enough time or progress.
> > 
> > Jan Kara, Dave Chinner and me explored the scheme to let
> > balance_dirty_pages() wait for enough writeback IO completions to
> > safeguard the dirty limit. This is found to have two problems:
> > 
> > - in large NUMA systems, the per-cpu counters may have big accounting
> >   errors, leading to big throttle wait time and jitters.
> > 
> > - NFS may kill large amount of unstable pages with one single COMMIT.
> >   Because NFS server serves COMMIT with expensive fsync() IOs, it is
> >   desirable to delay and reduce the number of COMMITs. So it's not
> >   likely to optimize away such kind of bursty IO completions, and the
> >   resulted large (and tiny) stall times in IO completion based throttling.
> > 
> > So here is a pause time oriented approach, which tries to control
> > 
> > - the pause time in each balance_dirty_pages() invocations
> > - the number of pages dirtied before calling balance_dirty_pages()
> > 
> > for smooth and efficient dirty throttling:
> > 
> > - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> > - avoid too small pause time (less than  10ms, which burns CPU power)
> 
> For fast arrays, 10ms may be to high a lower bound. e.g. at 1GB/s,
> 10ms = 10MB written, at 10GB/s it is 100MB, so lower bounds for
> faster arrays might be necessary to prevent unneccessarily long
> wakeup latencies....

No problem. Then the user may need to enable HZ=1000 to get 1ms lower
bound.

I chose 10ms boundary because I believed it's good to keep at least 1
second worth of dirty/writeback pages for good IO performance. (Is
that true for your experiences?)

That means 10GB for a 10GB/s device. Given such a big pool of dirty
pages, a 10ms pause time (eg. 100MB drop of dirty pages) seems pretty
small.

> > CONTROL SYSTEM
> > ==============
> > 
> > The current task_dirty_limit() adjusts bdi_thresh according to the dirty
> > "weight" of the current task, which is the percent of pages recently
> > dirtied by the task. If 100% pages are recently dirtied by the task, it
> > will lower bdi_thresh by 1/8. If only 1% pages are dirtied by the task,
> > it will return almost unmodified bdi_thresh. In this way, a heavy
> > dirtier will get blocked at (bdi_thresh-bdi_thresh/8) while allowing a
> > light dirtier to progress (the latter won't be blocked because R << B in
> > fig.1).
> > 
> > Fig.1 before patch, a heavy dirtier and a light dirtier
> >                                                 R
> > ----------------------------------------------+-o---------------------------*--|
> >                                               L A                           B  T
> >   T: bdi_dirty_limit
> >   L: bdi_dirty_limit - bdi_dirty_limit/8
> > 
> >   R: bdi_reclaimable + bdi_writeback
> > 
> >   A: bdi_thresh for a heavy dirtier ~= R ~= L
> >   B: bdi_thresh for a light dirtier ~= T
> 
> Let me get your terminology straight:
> 
> 	T = throttle threshold

It's the value returned from bdi_dirty_limit() before calling
task_dirty_limit().

> 	L = lower throttle bound

Right.

> 	R = reclaimable pages

R = dirty + writeback + unstable pages currently in the bdi

> 	A/B: two dritying processes

A/B means processes in some contexts and the value returned from
task_dirty_limit() in other contexts.

> > 
> > If B is a newly started heavy dirtier, then it will slowly gain weight
> > and A will lose weight.  The bdi_thresh for A and B will be approaching
> > the center of region (L, T) and eventually stabilize there.
> > 
> > Fig.2 before patch, two heavy dirtiers converging to the same threshold
> >                                                              R
> > ----------------------------------------------+--------------o-*---------------|
> >                                               L              A B               T
> > 
> > Now for IO-less balance_dirty_pages(), let's do it in a "bandwidth control"
> > way. In fig.3, a soft dirty limit region (L, A) is introduced. When R enters
> > this region, the task may be throttled for T seconds on every N pages it dirtied.
> > Let's call (N/T) the "throttle bandwidth". It is computed by the following fomula:
> > 
> 
> Now you've redefined R, L, T and A to mean completely different
> things. That's kind of confusing, because you use them in similar
> graphs

The meanings are slightly redefined _after_ this patch. But for the above
graph, the meanings are still the same, except that A/B/R is floating
around as time go by.

> >         throttle_bandwidth = bdi_bandwidth * (A - R) / (A - L)
> > where
> >         L = A - A/16
> > 	A = T - T/16

        A = T - weight*(T/16)

where weight is 0 for very light dirtier and 1 for the one heavy
dirtier (that consumes 100% bdi write bandwidth)

Sorry..

> That means A and L are constants, so your algorithm comes down to
> a first-order linear system:
> 
> 	throttle_bandwidth = bdi_bandwidth * (15 - 16R/T)
> 
> that will only work in the range of 7/8T < R < 15/16T. That is,
> for R < L, throttle bandwidth will be calculated to be greater than
> bdi_bandwidth,

When R < L, we don't throttle it at all.

> and for R > A, throttle bandwidth will be negative.

When R > A, the code will detect the negativeness and choose to pause
200ms (the upper pause boundary), then loop over again.

> > So when there is only one heavy dirtier (fig.3),
> > 
> >         R ~= L
> >         throttle_bandwidth ~= bdi_bandwidth
> > 
> > It's a stable balance:
> > - when R > L, then throttle_bandwidth < bdi_bandwidth, so R will decrease to L
> > - when R < L, then throttle_bandwidth > bdi_bandwidth, so R will increase to L
> 
> That does not imply stability. First-order control algorithms are
> generally unstable - they have trouble with convergence and tend to
> overshoot and oscillate - because you can't easily control the rate
> of change of the controlled variable.

Sure there are always oscillations. And it would be bounded. When
there is 1 heavy dirtier, the error bound will be nr_dirtied_pause
and/or (pause time * bdi bandwidth). When there are 2 equal speed 
dirtiers, the max error is 2 * (pause time * bdi bandwidth/2), which
is still the same (given the same pause time).

The error is unavoidable anyway as long as you do throttle pause of
any kind, either by waiting for enough IO completions, or simply to
wait for some calculated pause time. The core feature offered by this
patch is, it allows you to control error by explicitly controlling the
pause time :)

> > Fig.3 after patch, one heavy dirtier
> > 
> >                                                 |
> >     throttle_bandwidth ~= bdi_bandwidth  =>     o
> >                                                 | o
> >                                                 |   o
> >                                                 |     o
> >                                                 |       o
> >                                                 |         o
> >                                               L |           o
> > ----------------------------------------------+-+-------------o----------------|
> >                                                 R             A                T
> >   T: bdi_dirty_limit
> >   A: task_dirty_limit = bdi_dirty_limit - bdi_dirty_limit/16

Here A = T - weight*T/16 = T - 1*T/16 = T*15/16

> >   L: task_dirty_limit - task_dirty_limit/16
> > 
> >   R: bdi_reclaimable + bdi_writeback ~= L
> > 
> > When there comes a new cp task, its weight will grow from 0 to 50%.
> 
> While the other decreases from 100% to 50%? What causes this?

The weight will be updated independently by task_dirty_inc() at
set_page_dirty() time.

> > When the weight is still small, it's considered a light dirtier and it's
> > allowed to dirty pages much faster than the bdi write bandwidth. In fact
> > initially it won't be throttled at all when R < Lb where Lb=B-B/16 and B~=T.
> 
> I'm missing something - if the task_dirty_limit is T/16, then the

task_dirty_limit is A = T - weight*(T/16). T/16 is the max possible
region size.

> the first task will have consumed all the dirty pages up to this
> point (i.e. R ~= T/16). The then second task starts, and while it is

With only 1 heavy dirtier, R = L = A - A/16 where A = T - 1*T/16
(weight=1).

> unthrottled, it will push R well past T. That will cause the first

When the second task B starts, its weight=0, so B=T-weight*T/16=T.
It will get throttled as soon as R (which is independent of tasks)
enters Lb (ie. the L value for B) = B - B/16, it will get gently
throttled at exactly the bdi bandwidth. So R will grow, because now
the dirty bandwidth of A plus B exceeds the bdi write bandwidth.
As R grows, both A and B will get more and more throttled, until
reaching the new balance point.

> task to throttle hard almost immediately, and effectively get
> throttled until the weight of the second task passes the "heavy"
> threshold.  The first task won't get unthrottled until R passes back
> down below T. That seems undesirable....
> 
> > Fig.4 after patch, an old cp + a newly started cp
> > 
> >                      (throttle bandwidth) =>    *
> >                                                 | *
> >                                                 |   *
> >                                                 |     *
> >                                                 |       *
> >                                                 |         *
> >                                                 |           *
> >                                                 |             *
> >                       throttle bandwidth  =>    o               *
> >                                                 | o               *
> >                                                 |   o               *
> >                                                 |     o               *
> >                                                 |       o               *
> >                                                 |         o               *
> >                                                 |           o               *
> > ------------------------------------------------+-------------o---------------*|
> >                                                 R             A               BT
> > 
> > So R will quickly grow large (fig.5). As the two heavy dirtiers' weight
> > converge to 50%, the points A, B will go towards each other and
> 
> This assumes that the two processes are reaching equal amount sof
> dirty pages in the page cache? (weight is not defined anywhere, so I
> can't tell from reading the document how it is calculated)

R = bdi dirty + writeback + unstable pages, which is independent of
(and has the same value for) any tasks. There is only one vertical
line of R; and many diagonal lines, one for each task.

> > eventually become one in fig.5. R will stabilize around A-A/32 where
> > A=B=T-T/16. throttle_bandwidth will stabilize around bdi_bandwidth/2.
> 
> Why? You haven't explained how weight affects any of the defined
> variables

Sorry. The corrected sentence is

        R will stabilize around A-A/32 where A=B=T-0.5*T/16=T-T/32.
        throttle_bandwidth will stabilize around bdi_bandwidth/2.

> > There won't be big oscillations between A and B, because as long as A
> > coincides with B, their throttle_bandwidth and dirtied pages will be
> > equal, A's weight will stop decreasing and B's weight will stop growing,
> > so the two points won't keep moving and cross each other. So it's a
> > pretty stable control system. The only problem is, it converges a bit
> > slow (except for really fast storage array).
> 
> Convergence should really be independent of the write speed,
> otherwise we'll be forever trying to find the "best" value for
> different configurations.

The convergence speed mainly affects R and much less on the write
speeds of A/B. So this is not a big problem.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 02/17] writeback: IO-less balance_dirty_pages()
@ 2010-09-13 11:38       ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-13 11:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-mm, LKML, Chris Mason, Jan Kara, Peter Zijlstra,
	Jens Axboe, Andrew Morton, Theodore Ts'o, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Christoph Hellwig, Li, Shaohua

On Mon, Sep 13, 2010 at 04:45:34PM +0800, Dave Chinner wrote:
> On Sun, Sep 12, 2010 at 11:49:47PM +0800, Wu Fengguang wrote:
> > As proposed by Chris, Dave and Jan, don't start foreground writeback IO
> > inside balance_dirty_pages(). Instead, simply let it idle sleep for some
> > time to throttle the dirtying task. In the mean while, kick off the
> > per-bdi flusher thread to do background writeback IO.
> > 
> > This patch introduces the basic framework, which will be further
> > consolidated by the next patches.
> 
> Can you put all this documentation into, say,
> Documentation/filesystems/writeback-throttling-design.txt?

OK.

> FWIW, I'm reading this and commenting without having looked at the
> code - I want to understand the design, not the implementation ;)

To get a further understanding of the dynamics, you are advised to run
it on a fast storage and check out the traces as shown in patch 07 ;)

> > RATIONALS
> > =========
> > 
> > The current balance_dirty_pages() is rather IO inefficient.
> > 
> > - concurrent writeback of multiple inodes (Dave Chinner)
> > 
> >   If every thread doing writes and being throttled start foreground
> >   writeback, it leads to N IO submitters from at least N different
> >   inodes at the same time, end up with N different sets of IO being
> >   issued with potentially zero locality to each other, resulting in
> >   much lower elevator sort/merge efficiency and hence we seek the disk
> >   all over the place to service the different sets of IO.
> >   OTOH, if there is only one submission thread, it doesn't jump between
> >   inodes in the same way when congestion clears - it keeps writing to
> >   the same inode, resulting in large related chunks of sequential IOs
> >   being issued to the disk. This is more efficient than the above
> >   foreground writeback because the elevator works better and the disk
> >   seeks less.
> > 
> > - small nr_to_write for fast arrays
> > 
> >   The write_chunk used by current balance_dirty_pages() cannot be
> >   directly set to some large value (eg. 128MB) for better IO efficiency.
> >   Because it could lead to more than 1 second user perceivable stalls.
> >   This limits current balance_dirty_pages() to small inefficient IOs.
> 
> Contrary to popular belief, I don't think nr_to_write is too small.
> It's slow devices that cause problems with large chunks, not fast
> arrays.

Then we have another merit "shorter stall time for slow devices" :)
This algorithm is able to adapt to reasonable pause time for both fast
and slow devices.

> > For the above two reasons, it's much better to shift IO to the flusher
> > threads and let balance_dirty_pages() just wait for enough time or progress.
> > 
> > Jan Kara, Dave Chinner and me explored the scheme to let
> > balance_dirty_pages() wait for enough writeback IO completions to
> > safeguard the dirty limit. This is found to have two problems:
> > 
> > - in large NUMA systems, the per-cpu counters may have big accounting
> >   errors, leading to big throttle wait time and jitters.
> > 
> > - NFS may kill large amount of unstable pages with one single COMMIT.
> >   Because NFS server serves COMMIT with expensive fsync() IOs, it is
> >   desirable to delay and reduce the number of COMMITs. So it's not
> >   likely to optimize away such kind of bursty IO completions, and the
> >   resulted large (and tiny) stall times in IO completion based throttling.
> > 
> > So here is a pause time oriented approach, which tries to control
> > 
> > - the pause time in each balance_dirty_pages() invocations
> > - the number of pages dirtied before calling balance_dirty_pages()
> > 
> > for smooth and efficient dirty throttling:
> > 
> > - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> > - avoid too small pause time (less than  10ms, which burns CPU power)
> 
> For fast arrays, 10ms may be to high a lower bound. e.g. at 1GB/s,
> 10ms = 10MB written, at 10GB/s it is 100MB, so lower bounds for
> faster arrays might be necessary to prevent unneccessarily long
> wakeup latencies....

No problem. Then the user may need to enable HZ=1000 to get 1ms lower
bound.

I chose 10ms boundary because I believed it's good to keep at least 1
second worth of dirty/writeback pages for good IO performance. (Is
that true for your experiences?)

That means 10GB for a 10GB/s device. Given such a big pool of dirty
pages, a 10ms pause time (eg. 100MB drop of dirty pages) seems pretty
small.

> > CONTROL SYSTEM
> > ==============
> > 
> > The current task_dirty_limit() adjusts bdi_thresh according to the dirty
> > "weight" of the current task, which is the percent of pages recently
> > dirtied by the task. If 100% pages are recently dirtied by the task, it
> > will lower bdi_thresh by 1/8. If only 1% pages are dirtied by the task,
> > it will return almost unmodified bdi_thresh. In this way, a heavy
> > dirtier will get blocked at (bdi_thresh-bdi_thresh/8) while allowing a
> > light dirtier to progress (the latter won't be blocked because R << B in
> > fig.1).
> > 
> > Fig.1 before patch, a heavy dirtier and a light dirtier
> >                                                 R
> > ----------------------------------------------+-o---------------------------*--|
> >                                               L A                           B  T
> >   T: bdi_dirty_limit
> >   L: bdi_dirty_limit - bdi_dirty_limit/8
> > 
> >   R: bdi_reclaimable + bdi_writeback
> > 
> >   A: bdi_thresh for a heavy dirtier ~= R ~= L
> >   B: bdi_thresh for a light dirtier ~= T
> 
> Let me get your terminology straight:
> 
> 	T = throttle threshold

It's the value returned from bdi_dirty_limit() before calling
task_dirty_limit().

> 	L = lower throttle bound

Right.

> 	R = reclaimable pages

R = dirty + writeback + unstable pages currently in the bdi

> 	A/B: two dritying processes

A/B means processes in some contexts and the value returned from
task_dirty_limit() in other contexts.

> > 
> > If B is a newly started heavy dirtier, then it will slowly gain weight
> > and A will lose weight.  The bdi_thresh for A and B will be approaching
> > the center of region (L, T) and eventually stabilize there.
> > 
> > Fig.2 before patch, two heavy dirtiers converging to the same threshold
> >                                                              R
> > ----------------------------------------------+--------------o-*---------------|
> >                                               L              A B               T
> > 
> > Now for IO-less balance_dirty_pages(), let's do it in a "bandwidth control"
> > way. In fig.3, a soft dirty limit region (L, A) is introduced. When R enters
> > this region, the task may be throttled for T seconds on every N pages it dirtied.
> > Let's call (N/T) the "throttle bandwidth". It is computed by the following fomula:
> > 
> 
> Now you've redefined R, L, T and A to mean completely different
> things. That's kind of confusing, because you use them in similar
> graphs

The meanings are slightly redefined _after_ this patch. But for the above
graph, the meanings are still the same, except that A/B/R is floating
around as time go by.

> >         throttle_bandwidth = bdi_bandwidth * (A - R) / (A - L)
> > where
> >         L = A - A/16
> > 	A = T - T/16

        A = T - weight*(T/16)

where weight is 0 for very light dirtier and 1 for the one heavy
dirtier (that consumes 100% bdi write bandwidth)

Sorry..

> That means A and L are constants, so your algorithm comes down to
> a first-order linear system:
> 
> 	throttle_bandwidth = bdi_bandwidth * (15 - 16R/T)
> 
> that will only work in the range of 7/8T < R < 15/16T. That is,
> for R < L, throttle bandwidth will be calculated to be greater than
> bdi_bandwidth,

When R < L, we don't throttle it at all.

> and for R > A, throttle bandwidth will be negative.

When R > A, the code will detect the negativeness and choose to pause
200ms (the upper pause boundary), then loop over again.

> > So when there is only one heavy dirtier (fig.3),
> > 
> >         R ~= L
> >         throttle_bandwidth ~= bdi_bandwidth
> > 
> > It's a stable balance:
> > - when R > L, then throttle_bandwidth < bdi_bandwidth, so R will decrease to L
> > - when R < L, then throttle_bandwidth > bdi_bandwidth, so R will increase to L
> 
> That does not imply stability. First-order control algorithms are
> generally unstable - they have trouble with convergence and tend to
> overshoot and oscillate - because you can't easily control the rate
> of change of the controlled variable.

Sure there are always oscillations. And it would be bounded. When
there is 1 heavy dirtier, the error bound will be nr_dirtied_pause
and/or (pause time * bdi bandwidth). When there are 2 equal speed 
dirtiers, the max error is 2 * (pause time * bdi bandwidth/2), which
is still the same (given the same pause time).

The error is unavoidable anyway as long as you do throttle pause of
any kind, either by waiting for enough IO completions, or simply to
wait for some calculated pause time. The core feature offered by this
patch is, it allows you to control error by explicitly controlling the
pause time :)

> > Fig.3 after patch, one heavy dirtier
> > 
> >                                                 |
> >     throttle_bandwidth ~= bdi_bandwidth  =>     o
> >                                                 | o
> >                                                 |   o
> >                                                 |     o
> >                                                 |       o
> >                                                 |         o
> >                                               L |           o
> > ----------------------------------------------+-+-------------o----------------|
> >                                                 R             A                T
> >   T: bdi_dirty_limit
> >   A: task_dirty_limit = bdi_dirty_limit - bdi_dirty_limit/16

Here A = T - weight*T/16 = T - 1*T/16 = T*15/16

> >   L: task_dirty_limit - task_dirty_limit/16
> > 
> >   R: bdi_reclaimable + bdi_writeback ~= L
> > 
> > When there comes a new cp task, its weight will grow from 0 to 50%.
> 
> While the other decreases from 100% to 50%? What causes this?

The weight will be updated independently by task_dirty_inc() at
set_page_dirty() time.

> > When the weight is still small, it's considered a light dirtier and it's
> > allowed to dirty pages much faster than the bdi write bandwidth. In fact
> > initially it won't be throttled at all when R < Lb where Lb=B-B/16 and B~=T.
> 
> I'm missing something - if the task_dirty_limit is T/16, then the

task_dirty_limit is A = T - weight*(T/16). T/16 is the max possible
region size.

> the first task will have consumed all the dirty pages up to this
> point (i.e. R ~= T/16). The then second task starts, and while it is

With only 1 heavy dirtier, R = L = A - A/16 where A = T - 1*T/16
(weight=1).

> unthrottled, it will push R well past T. That will cause the first

When the second task B starts, its weight=0, so B=T-weight*T/16=T.
It will get throttled as soon as R (which is independent of tasks)
enters Lb (ie. the L value for B) = B - B/16, it will get gently
throttled at exactly the bdi bandwidth. So R will grow, because now
the dirty bandwidth of A plus B exceeds the bdi write bandwidth.
As R grows, both A and B will get more and more throttled, until
reaching the new balance point.

> task to throttle hard almost immediately, and effectively get
> throttled until the weight of the second task passes the "heavy"
> threshold.  The first task won't get unthrottled until R passes back
> down below T. That seems undesirable....
> 
> > Fig.4 after patch, an old cp + a newly started cp
> > 
> >                      (throttle bandwidth) =>    *
> >                                                 | *
> >                                                 |   *
> >                                                 |     *
> >                                                 |       *
> >                                                 |         *
> >                                                 |           *
> >                                                 |             *
> >                       throttle bandwidth  =>    o               *
> >                                                 | o               *
> >                                                 |   o               *
> >                                                 |     o               *
> >                                                 |       o               *
> >                                                 |         o               *
> >                                                 |           o               *
> > ------------------------------------------------+-------------o---------------*|
> >                                                 R             A               BT
> > 
> > So R will quickly grow large (fig.5). As the two heavy dirtiers' weight
> > converge to 50%, the points A, B will go towards each other and
> 
> This assumes that the two processes are reaching equal amount sof
> dirty pages in the page cache? (weight is not defined anywhere, so I
> can't tell from reading the document how it is calculated)

R = bdi dirty + writeback + unstable pages, which is independent of
(and has the same value for) any tasks. There is only one vertical
line of R; and many diagonal lines, one for each task.

> > eventually become one in fig.5. R will stabilize around A-A/32 where
> > A=B=T-T/16. throttle_bandwidth will stabilize around bdi_bandwidth/2.
> 
> Why? You haven't explained how weight affects any of the defined
> variables

Sorry. The corrected sentence is

        R will stabilize around A-A/32 where A=B=T-0.5*T/16=T-T/32.
        throttle_bandwidth will stabilize around bdi_bandwidth/2.

> > There won't be big oscillations between A and B, because as long as A
> > coincides with B, their throttle_bandwidth and dirtied pages will be
> > equal, A's weight will stop decreasing and B's weight will stop growing,
> > so the two points won't keep moving and cross each other. So it's a
> > pretty stable control system. The only problem is, it converges a bit
> > slow (except for really fast storage array).
> 
> Convergence should really be independent of the write speed,
> otherwise we'll be forever trying to find the "best" value for
> different configurations.

The convergence speed mainly affects R and much less on the write
speeds of A/B. So this is not a big problem.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 05/17] writeback: quit throttling when signal pending
  2010-09-13  3:48           ` Wu Fengguang
@ 2010-09-14  8:23             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 98+ messages in thread
From: KOSAKI Motohiro @ 2010-09-14  8:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Neil Brown, linux-mm, LKML, Andrew Morton,
	Theodore Ts'o, Dave Chinner, Jan Kara, Peter Zijlstra,
	Mel Gorman, Rik van Riel, Chris Mason, Christoph Hellwig, Li,
	Shaohua

> Subject: writeback: quit throttling when fatal signal pending
> From: Wu Fengguang <fengguang.wu@intel.com>
> Date: Wed Sep 08 17:40:22 CST 2010
> 
> This allows quick response to Ctrl-C etc. for impatient users.
> 
> It mainly helps the rare bdi/global dirty exceeded cases.
> In the normal case of not exceeded, it will quit the loop anyway. 
> 
> CC: Neil Brown <neilb@suse.de>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2010-09-12 13:25:23.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-09-13 11:39:33.000000000 +0800
> @@ -552,6 +552,9 @@ static void balance_dirty_pages(struct a
>  		__set_current_state(TASK_INTERRUPTIBLE);
>  		io_schedule_timeout(pause);
>  
> +		if (fatal_signal_pending(current))
> +			break;
> +
>  check_exceeded:
>  		/*
>  		 * The bdi thresh is somehow "soft" limit derived from the

I think we need to change callers (e.g. generic_perform_write) too.
Otherwise, plenty write + SIGKILL combination easily exceed dirty limit.
It mean we can see strange OOM.




^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 05/17] writeback: quit throttling when signal pending
@ 2010-09-14  8:23             ` KOSAKI Motohiro
  0 siblings, 0 replies; 98+ messages in thread
From: KOSAKI Motohiro @ 2010-09-14  8:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Neil Brown, linux-mm, LKML, Andrew Morton,
	Theodore Ts'o, Dave Chinner, Jan Kara, Peter Zijlstra,
	Mel Gorman, Rik van Riel, Chris Mason, Christoph Hellwig, Li,
	Shaohua

> Subject: writeback: quit throttling when fatal signal pending
> From: Wu Fengguang <fengguang.wu@intel.com>
> Date: Wed Sep 08 17:40:22 CST 2010
> 
> This allows quick response to Ctrl-C etc. for impatient users.
> 
> It mainly helps the rare bdi/global dirty exceeded cases.
> In the normal case of not exceeded, it will quit the loop anyway. 
> 
> CC: Neil Brown <neilb@suse.de>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2010-09-12 13:25:23.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-09-13 11:39:33.000000000 +0800
> @@ -552,6 +552,9 @@ static void balance_dirty_pages(struct a
>  		__set_current_state(TASK_INTERRUPTIBLE);
>  		io_schedule_timeout(pause);
>  
> +		if (fatal_signal_pending(current))
> +			break;
> +
>  check_exceeded:
>  		/*
>  		 * The bdi thresh is somehow "soft" limit derived from the

I think we need to change callers (e.g. generic_perform_write) too.
Otherwise, plenty write + SIGKILL combination easily exceed dirty limit.
It mean we can see strange OOM.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 08/17] writeback: account per-bdi accumulated written pages
  2010-09-12 15:49   ` Wu Fengguang
@ 2010-09-14  8:32     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 98+ messages in thread
From: KOSAKI Motohiro @ 2010-09-14  8:32 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, linux-mm, LKML, Jan Kara, Andrew Morton,
	Theodore Ts'o, Dave Chinner, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Chris Mason, Christoph Hellwig, Li Shaohua

> Introduce the BDI_WRITTEN counter. It will be used for estimating the
> bdi's write bandwidth.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

I like this patch :-)
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 08/17] writeback: account per-bdi accumulated written pages
@ 2010-09-14  8:32     ` KOSAKI Motohiro
  0 siblings, 0 replies; 98+ messages in thread
From: KOSAKI Motohiro @ 2010-09-14  8:32 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, linux-mm, LKML, Jan Kara, Andrew Morton,
	Theodore Ts'o, Dave Chinner, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Chris Mason, Christoph Hellwig, Li Shaohua

> Introduce the BDI_WRITTEN counter. It will be used for estimating the
> bdi's write bandwidth.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

I like this patch :-)
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 05/17] writeback: quit throttling when signal pending
  2010-09-14  8:23             ` KOSAKI Motohiro
@ 2010-09-14  8:33               ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-14  8:33 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Neil Brown, linux-mm, LKML, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Chris Mason, Christoph Hellwig, Li, Shaohua

On Tue, Sep 14, 2010 at 04:23:56PM +0800, KOSAKI Motohiro wrote:
> > Subject: writeback: quit throttling when fatal signal pending
> > From: Wu Fengguang <fengguang.wu@intel.com>
> > Date: Wed Sep 08 17:40:22 CST 2010
> > 
> > This allows quick response to Ctrl-C etc. for impatient users.
> > 
> > It mainly helps the rare bdi/global dirty exceeded cases.
> > In the normal case of not exceeded, it will quit the loop anyway. 
> > 
> > CC: Neil Brown <neilb@suse.de>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/page-writeback.c |    3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > --- linux-next.orig/mm/page-writeback.c	2010-09-12 13:25:23.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2010-09-13 11:39:33.000000000 +0800
> > @@ -552,6 +552,9 @@ static void balance_dirty_pages(struct a
> >  		__set_current_state(TASK_INTERRUPTIBLE);
> >  		io_schedule_timeout(pause);
> >  
> > +		if (fatal_signal_pending(current))
> > +			break;
> > +
> >  check_exceeded:
> >  		/*
> >  		 * The bdi thresh is somehow "soft" limit derived from the
> 
> I think we need to change callers (e.g. generic_perform_write) too.
> Otherwise, plenty write + SIGKILL combination easily exceed dirty limit.
> It mean we can see strange OOM.

If it's dangerous, we can do without this patch.  The users can still
get quick response in normal case after all.

However, I suspect the process is guaranteed to exit on
fatal_signal_pending, so it won't dirty more pages :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 05/17] writeback: quit throttling when signal pending
@ 2010-09-14  8:33               ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-14  8:33 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Neil Brown, linux-mm, LKML, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Chris Mason, Christoph Hellwig, Li, Shaohua

On Tue, Sep 14, 2010 at 04:23:56PM +0800, KOSAKI Motohiro wrote:
> > Subject: writeback: quit throttling when fatal signal pending
> > From: Wu Fengguang <fengguang.wu@intel.com>
> > Date: Wed Sep 08 17:40:22 CST 2010
> > 
> > This allows quick response to Ctrl-C etc. for impatient users.
> > 
> > It mainly helps the rare bdi/global dirty exceeded cases.
> > In the normal case of not exceeded, it will quit the loop anyway. 
> > 
> > CC: Neil Brown <neilb@suse.de>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/page-writeback.c |    3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > --- linux-next.orig/mm/page-writeback.c	2010-09-12 13:25:23.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2010-09-13 11:39:33.000000000 +0800
> > @@ -552,6 +552,9 @@ static void balance_dirty_pages(struct a
> >  		__set_current_state(TASK_INTERRUPTIBLE);
> >  		io_schedule_timeout(pause);
> >  
> > +		if (fatal_signal_pending(current))
> > +			break;
> > +
> >  check_exceeded:
> >  		/*
> >  		 * The bdi thresh is somehow "soft" limit derived from the
> 
> I think we need to change callers (e.g. generic_perform_write) too.
> Otherwise, plenty write + SIGKILL combination easily exceed dirty limit.
> It mean we can see strange OOM.

If it's dangerous, we can do without this patch.  The users can still
get quick response in normal case after all.

However, I suspect the process is guaranteed to exit on
fatal_signal_pending, so it won't dirty more pages :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 05/17] writeback: quit throttling when signal pending
  2010-09-14  8:33               ` Wu Fengguang
@ 2010-09-14  8:44                 ` KOSAKI Motohiro
  -1 siblings, 0 replies; 98+ messages in thread
From: KOSAKI Motohiro @ 2010-09-14  8:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Neil Brown, linux-mm, LKML, Andrew Morton,
	Theodore Ts'o, Dave Chinner, Jan Kara, Peter Zijlstra,
	Mel Gorman, Rik van Riel, Chris Mason, Christoph Hellwig, Li,
	Shaohua

> On Tue, Sep 14, 2010 at 04:23:56PM +0800, KOSAKI Motohiro wrote:
> > > Subject: writeback: quit throttling when fatal signal pending
> > > From: Wu Fengguang <fengguang.wu@intel.com>
> > > Date: Wed Sep 08 17:40:22 CST 2010
> > > 
> > > This allows quick response to Ctrl-C etc. for impatient users.
> > > 
> > > It mainly helps the rare bdi/global dirty exceeded cases.
> > > In the normal case of not exceeded, it will quit the loop anyway. 
> > > 
> > > CC: Neil Brown <neilb@suse.de>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  mm/page-writeback.c |    3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > --- linux-next.orig/mm/page-writeback.c	2010-09-12 13:25:23.000000000 +0800
> > > +++ linux-next/mm/page-writeback.c	2010-09-13 11:39:33.000000000 +0800
> > > @@ -552,6 +552,9 @@ static void balance_dirty_pages(struct a
> > >  		__set_current_state(TASK_INTERRUPTIBLE);
> > >  		io_schedule_timeout(pause);
> > >  
> > > +		if (fatal_signal_pending(current))
> > > +			break;
> > > +
> > >  check_exceeded:
> > >  		/*
> > >  		 * The bdi thresh is somehow "soft" limit derived from the
> > 
> > I think we need to change callers (e.g. generic_perform_write) too.
> > Otherwise, plenty write + SIGKILL combination easily exceed dirty limit.
> > It mean we can see strange OOM.
> 
> If it's dangerous, we can do without this patch.  

How?


> The users can still
> get quick response in normal case after all.
> 
> However, I suspect the process is guaranteed to exit on
> fatal_signal_pending, so it won't dirty more pages :)

Process exiting is delayed until syscall exiting. So, we exit write syscall
manually if necessary.

Am I missing anything?



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 05/17] writeback: quit throttling when signal pending
@ 2010-09-14  8:44                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 98+ messages in thread
From: KOSAKI Motohiro @ 2010-09-14  8:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Neil Brown, linux-mm, LKML, Andrew Morton,
	Theodore Ts'o, Dave Chinner, Jan Kara, Peter Zijlstra,
	Mel Gorman, Rik van Riel, Chris Mason, Christoph Hellwig, Li,
	Shaohua

> On Tue, Sep 14, 2010 at 04:23:56PM +0800, KOSAKI Motohiro wrote:
> > > Subject: writeback: quit throttling when fatal signal pending
> > > From: Wu Fengguang <fengguang.wu@intel.com>
> > > Date: Wed Sep 08 17:40:22 CST 2010
> > > 
> > > This allows quick response to Ctrl-C etc. for impatient users.
> > > 
> > > It mainly helps the rare bdi/global dirty exceeded cases.
> > > In the normal case of not exceeded, it will quit the loop anyway. 
> > > 
> > > CC: Neil Brown <neilb@suse.de>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  mm/page-writeback.c |    3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > --- linux-next.orig/mm/page-writeback.c	2010-09-12 13:25:23.000000000 +0800
> > > +++ linux-next/mm/page-writeback.c	2010-09-13 11:39:33.000000000 +0800
> > > @@ -552,6 +552,9 @@ static void balance_dirty_pages(struct a
> > >  		__set_current_state(TASK_INTERRUPTIBLE);
> > >  		io_schedule_timeout(pause);
> > >  
> > > +		if (fatal_signal_pending(current))
> > > +			break;
> > > +
> > >  check_exceeded:
> > >  		/*
> > >  		 * The bdi thresh is somehow "soft" limit derived from the
> > 
> > I think we need to change callers (e.g. generic_perform_write) too.
> > Otherwise, plenty write + SIGKILL combination easily exceed dirty limit.
> > It mean we can see strange OOM.
> 
> If it's dangerous, we can do without this patch.  

How?


> The users can still
> get quick response in normal case after all.
> 
> However, I suspect the process is guaranteed to exit on
> fatal_signal_pending, so it won't dirty more pages :)

Process exiting is delayed until syscall exiting. So, we exit write syscall
manually if necessary.

Am I missing anything?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 05/17] writeback: quit throttling when signal pending
  2010-09-14  8:44                 ` KOSAKI Motohiro
@ 2010-09-14  9:17                   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-14  9:17 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Neil Brown, linux-mm, LKML, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Chris Mason, Christoph Hellwig, Li, Shaohua

On Tue, Sep 14, 2010 at 04:44:37PM +0800, KOSAKI Motohiro wrote:
> > On Tue, Sep 14, 2010 at 04:23:56PM +0800, KOSAKI Motohiro wrote:
> > > > Subject: writeback: quit throttling when fatal signal pending
> > > > From: Wu Fengguang <fengguang.wu@intel.com>
> > > > Date: Wed Sep 08 17:40:22 CST 2010
> > > > 
> > > > This allows quick response to Ctrl-C etc. for impatient users.
> > > > 
> > > > It mainly helps the rare bdi/global dirty exceeded cases.
> > > > In the normal case of not exceeded, it will quit the loop anyway. 
> > > > 
> > > > CC: Neil Brown <neilb@suse.de>
> > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > ---
> > > >  mm/page-writeback.c |    3 +++
> > > >  1 file changed, 3 insertions(+)
> > > > 
> > > > --- linux-next.orig/mm/page-writeback.c	2010-09-12 13:25:23.000000000 +0800
> > > > +++ linux-next/mm/page-writeback.c	2010-09-13 11:39:33.000000000 +0800
> > > > @@ -552,6 +552,9 @@ static void balance_dirty_pages(struct a
> > > >  		__set_current_state(TASK_INTERRUPTIBLE);
> > > >  		io_schedule_timeout(pause);
> > > >  
> > > > +		if (fatal_signal_pending(current))
> > > > +			break;
> > > > +
> > > >  check_exceeded:
> > > >  		/*
> > > >  		 * The bdi thresh is somehow "soft" limit derived from the
> > > 
> > > I think we need to change callers (e.g. generic_perform_write) too.
> > > Otherwise, plenty write + SIGKILL combination easily exceed dirty limit.
> > > It mean we can see strange OOM.
> > 
> > If it's dangerous, we can do without this patch.  
> 
> How?

As you described.

> > The users can still
> > get quick response in normal case after all.
> > 
> > However, I suspect the process is guaranteed to exit on
> > fatal_signal_pending, so it won't dirty more pages :)
> 
> Process exiting is delayed until syscall exiting. So, we exit write syscall
> manually if necessary.

Got it, you mean this fix. It looks good. I didn't add "status =
-EINTR" in the patch because the bottom line "written ? : status" will
always select the non-zero written.

diff --git a/mm/filemap.c b/mm/filemap.c
index 3d4df44..f6d2740 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2304,7 +2304,8 @@ again:
 		written += copied;
 
 		balance_dirty_pages_ratelimited(mapping);
-
+		if (fatal_signal_pending(current))
+			break;
 	} while (iov_iter_count(i));
 
 	return written ? written : status;

Thanks,
Fengguang

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH 05/17] writeback: quit throttling when signal pending
@ 2010-09-14  9:17                   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-09-14  9:17 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Neil Brown, linux-mm, LKML, Andrew Morton, Theodore Ts'o,
	Dave Chinner, Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Chris Mason, Christoph Hellwig, Li, Shaohua

On Tue, Sep 14, 2010 at 04:44:37PM +0800, KOSAKI Motohiro wrote:
> > On Tue, Sep 14, 2010 at 04:23:56PM +0800, KOSAKI Motohiro wrote:
> > > > Subject: writeback: quit throttling when fatal signal pending
> > > > From: Wu Fengguang <fengguang.wu@intel.com>
> > > > Date: Wed Sep 08 17:40:22 CST 2010
> > > > 
> > > > This allows quick response to Ctrl-C etc. for impatient users.
> > > > 
> > > > It mainly helps the rare bdi/global dirty exceeded cases.
> > > > In the normal case of not exceeded, it will quit the loop anyway. 
> > > > 
> > > > CC: Neil Brown <neilb@suse.de>
> > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > ---
> > > >  mm/page-writeback.c |    3 +++
> > > >  1 file changed, 3 insertions(+)
> > > > 
> > > > --- linux-next.orig/mm/page-writeback.c	2010-09-12 13:25:23.000000000 +0800
> > > > +++ linux-next/mm/page-writeback.c	2010-09-13 11:39:33.000000000 +0800
> > > > @@ -552,6 +552,9 @@ static void balance_dirty_pages(struct a
> > > >  		__set_current_state(TASK_INTERRUPTIBLE);
> > > >  		io_schedule_timeout(pause);
> > > >  
> > > > +		if (fatal_signal_pending(current))
> > > > +			break;
> > > > +
> > > >  check_exceeded:
> > > >  		/*
> > > >  		 * The bdi thresh is somehow "soft" limit derived from the
> > > 
> > > I think we need to change callers (e.g. generic_perform_write) too.
> > > Otherwise, plenty write + SIGKILL combination easily exceed dirty limit.
> > > It mean we can see strange OOM.
> > 
> > If it's dangerous, we can do without this patch.  
> 
> How?

As you described.

> > The users can still
> > get quick response in normal case after all.
> > 
> > However, I suspect the process is guaranteed to exit on
> > fatal_signal_pending, so it won't dirty more pages :)
> 
> Process exiting is delayed until syscall exiting. So, we exit write syscall
> manually if necessary.

Got it, you mean this fix. It looks good. I didn't add "status =
-EINTR" in the patch because the bottom line "written ? : status" will
always select the non-zero written.

diff --git a/mm/filemap.c b/mm/filemap.c
index 3d4df44..f6d2740 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2304,7 +2304,8 @@ again:
 		written += copied;
 
 		balance_dirty_pages_ratelimited(mapping);
-
+		if (fatal_signal_pending(current))
+			break;
 	} while (iov_iter_count(i));
 
 	return written ? written : status;

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH 05/17] writeback: quit throttling when signal pending
  2010-09-14  9:17                   ` Wu Fengguang
@ 2010-09-14  9:25                     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 98+ messages in thread
From: KOSAKI Motohiro @ 2010-09-14  9:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Neil Brown, linux-mm, LKML, Andrew Morton,
	Theodore Ts'o, Dave Chinner, Jan Kara, Peter Zijlstra,
	Mel Gorman, Rik van Riel, Chris Mason, Christoph Hellwig, Li,
	Shaohua

> > > However, I suspect the process is guaranteed to exit on
> > > fatal_signal_pending, so it won't dirty more pages :)
> > 
> > Process exiting is delayed until syscall exiting. So, we exit write syscall
> > manually if necessary.
> 
> Got it, you mean this fix. It looks good. I didn't add "status =
> -EINTR" in the patch because the bottom line "written ? : status" will
> always select the non-zero written.
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 3d4df44..f6d2740 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2304,7 +2304,8 @@ again:
>  		written += copied;
>  
>  		balance_dirty_pages_ratelimited(mapping);
> -
> +		if (fatal_signal_pending(current))
> +			break;
>  	} while (iov_iter_count(i));

Looks good. however other callers also need to be updated.




^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 05/17] writeback: quit throttling when signal pending
@ 2010-09-14  9:25                     ` KOSAKI Motohiro
  0 siblings, 0 replies; 98+ messages in thread
From: KOSAKI Motohiro @ 2010-09-14  9:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Neil Brown, linux-mm, LKML, Andrew Morton,
	Theodore Ts'o, Dave Chinner, Jan Kara, Peter Zijlstra,
	Mel Gorman, Rik van Riel, Chris Mason, Christoph Hellwig, Li,
	Shaohua

> > > However, I suspect the process is guaranteed to exit on
> > > fatal_signal_pending, so it won't dirty more pages :)
> > 
> > Process exiting is delayed until syscall exiting. So, we exit write syscall
> > manually if necessary.
> 
> Got it, you mean this fix. It looks good. I didn't add "status =
> -EINTR" in the patch because the bottom line "written ? : status" will
> always select the non-zero written.
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 3d4df44..f6d2740 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2304,7 +2304,8 @@ again:
>  		written += copied;
>  
>  		balance_dirty_pages_ratelimited(mapping);
> -
> +		if (fatal_signal_pending(current))
> +			break;
>  	} while (iov_iter_count(i));

Looks good. however other callers also need to be updated.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
  2010-09-12 15:49 ` Wu Fengguang
@ 2010-10-12 14:17   ` Christoph Hellwig
  -1 siblings, 0 replies; 98+ messages in thread
From: Christoph Hellwig @ 2010-10-12 14:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-mm, LKML, Andrew Morton, Theodore Ts'o, Dave Chinner,
	Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

Wu, what's the state of this series?  It looks like we'll need it
rather sooner than later - try to get at least the preparations in
ASAP would be really helpful.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
@ 2010-10-12 14:17   ` Christoph Hellwig
  0 siblings, 0 replies; 98+ messages in thread
From: Christoph Hellwig @ 2010-10-12 14:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-mm, LKML, Andrew Morton, Theodore Ts'o, Dave Chinner,
	Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

Wu, what's the state of this series?  It looks like we'll need it
rather sooner than later - try to get at least the preparations in
ASAP would be really helpful.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
  2010-10-12 14:17   ` Christoph Hellwig
@ 2010-10-13  3:07     ` Dave Chinner
  -1 siblings, 0 replies; 98+ messages in thread
From: Dave Chinner @ 2010-10-13  3:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Wu Fengguang, linux-mm, LKML, Andrew Morton, Theodore Ts'o,
	Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

On Tue, Oct 12, 2010 at 10:17:16AM -0400, Christoph Hellwig wrote:
> Wu, what's the state of this series?  It looks like we'll need it
> rather sooner than later - try to get at least the preparations in
> ASAP would be really helpful.

Not ready in it's current form. This load (creating millions of 1
byte files in parallel):

$ /usr/bin/time ./fs_mark -D 10000 -S0 -n 100000 -s 1 -L 63 \
> -d /mnt/scratch/0 -d /mnt/scratch/1 \
> -d /mnt/scratch/2 -d /mnt/scratch/3 \
> -d /mnt/scratch/4 -d /mnt/scratch/5 \
> -d /mnt/scratch/6 -d /mnt/scratch/7

Locks up all the fs_mark processes spinning in traces like the
following and no further progress is made when the inode cache
fills memory.

[ 2601.452017] fs_mark       R  running task        0  2303   2235 0x00000008
[ 2601.452017]  ffff8801188f7878 ffffffff8103e2c9 ffff8801188f78a8 0000000000000000
[ 2601.452017]  0000000000000002 ffff8801129e21c0 ffff880002fd44c0 0000000000000000
[ 2601.452017]  ffff8801188f78b8 ffffffff810a9a08 ffff8801188f78e8 ffffffff810a98e5
[ 2601.452017] Call Trace:
[ 2601.452017]  [<ffffffff81060edc>] ? kvm_clock_read+0x1c/0x20
[ 2601.452017]  [<ffffffff8103e2c9>] ? sched_clock+0x9/0x10
[ 2601.452017]  [<ffffffff810a98e5>] ? sched_clock_local+0x25/0x90
[ 2601.452017]  [<ffffffff810b9e00>] ? __lock_acquire+0x330/0x14d0
[ 2601.452017]  [<ffffffff810a9a94>] ? local_clock+0x34/0x80
[ 2601.452017]  [<ffffffff81061cc8>] ? pvclock_clocksource_read+0x58/0xd0
[ 2601.452017]  [<ffffffff81061cc8>] ? pvclock_clocksource_read+0x58/0xd0
[ 2601.452017]  [<ffffffff81060edc>] ? kvm_clock_read+0x1c/0x20
[ 2601.452017]  [<ffffffff8103e2c9>] ? sched_clock+0x9/0x10
[ 2601.452017]  [<ffffffff810bb054>] ? lock_acquire+0xb4/0x140
[ 2601.452017]  [<ffffffff8103e2c9>] ? sched_clock+0x9/0x10
[ 2601.452017]  [<ffffffff810a98e5>] ? sched_clock_local+0x25/0x90
[ 2601.452017]  [<ffffffff81698ea2>] ? prop_get_global+0x32/0x50
[ 2601.452017]  [<ffffffff81699230>] ? prop_fraction_percpu+0x30/0xa0
[ 2601.452017]  [<ffffffff8111af3b>] ? bdi_dirty_limit+0x9b/0xe0
[ 2601.452017]  [<ffffffff8111bbd8>] ? balance_dirty_pages_ratelimited_nr+0x178/0x580
[ 2601.452017]  [<ffffffff81ad440b>] ? _raw_spin_unlock+0x2b/0x40
[ 2601.452017]  [<ffffffff8117ccd5>] ? __mark_inode_dirty+0xc5/0x230
[ 2601.452017]  [<ffffffff811114d5>] ? iov_iter_copy_from_user_atomic+0x95/0x170
[ 2601.452017]  [<ffffffff811118fc>] ? generic_file_buffered_write+0x1cc/0x270
[ 2601.452017]  [<ffffffff81492f2f>] ? xfs_file_aio_write+0x79f/0xaf0
[ 2601.452017]  [<ffffffff81060edc>] ? kvm_clock_read+0x1c/0x20
[ 2601.452017]  [<ffffffff81060edc>] ? kvm_clock_read+0x1c/0x20
[ 2601.452017]  [<ffffffff8103e2c9>] ? sched_clock+0x9/0x10
[ 2601.452017]  [<ffffffff810a98e5>] ? sched_clock_local+0x25/0x90
[ 2601.452017]  [<ffffffff81157cca>] ? do_sync_write+0xda/0x120
[ 2601.452017]  [<ffffffff8112e20c>] ? might_fault+0x5c/0xb0
[ 2601.452017]  [<ffffffff81669f7f>] ? security_file_permission+0x1f/0x80
[ 2601.452017]  [<ffffffff81157fb8>] ? vfs_write+0xc8/0x180
[ 2601.452017]  [<ffffffff81158904>] ? sys_write+0x54/0x90
[ 2601.452017]  [<ffffffff81037072>] ? system_call_fastpath+0x16/0x1b

This is on an 8p/4GB RAM VM.

FWIW, this one test now has a proven record of exposing writeback,
VM and filesystem regressions, so I'd suggest that anyone doing any
sort of work that affects writeback adds it to their test matrix....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
@ 2010-10-13  3:07     ` Dave Chinner
  0 siblings, 0 replies; 98+ messages in thread
From: Dave Chinner @ 2010-10-13  3:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Wu Fengguang, linux-mm, LKML, Andrew Morton, Theodore Ts'o,
	Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li Shaohua

On Tue, Oct 12, 2010 at 10:17:16AM -0400, Christoph Hellwig wrote:
> Wu, what's the state of this series?  It looks like we'll need it
> rather sooner than later - try to get at least the preparations in
> ASAP would be really helpful.

Not ready in it's current form. This load (creating millions of 1
byte files in parallel):

$ /usr/bin/time ./fs_mark -D 10000 -S0 -n 100000 -s 1 -L 63 \
> -d /mnt/scratch/0 -d /mnt/scratch/1 \
> -d /mnt/scratch/2 -d /mnt/scratch/3 \
> -d /mnt/scratch/4 -d /mnt/scratch/5 \
> -d /mnt/scratch/6 -d /mnt/scratch/7

Locks up all the fs_mark processes spinning in traces like the
following and no further progress is made when the inode cache
fills memory.

[ 2601.452017] fs_mark       R  running task        0  2303   2235 0x00000008
[ 2601.452017]  ffff8801188f7878 ffffffff8103e2c9 ffff8801188f78a8 0000000000000000
[ 2601.452017]  0000000000000002 ffff8801129e21c0 ffff880002fd44c0 0000000000000000
[ 2601.452017]  ffff8801188f78b8 ffffffff810a9a08 ffff8801188f78e8 ffffffff810a98e5
[ 2601.452017] Call Trace:
[ 2601.452017]  [<ffffffff81060edc>] ? kvm_clock_read+0x1c/0x20
[ 2601.452017]  [<ffffffff8103e2c9>] ? sched_clock+0x9/0x10
[ 2601.452017]  [<ffffffff810a98e5>] ? sched_clock_local+0x25/0x90
[ 2601.452017]  [<ffffffff810b9e00>] ? __lock_acquire+0x330/0x14d0
[ 2601.452017]  [<ffffffff810a9a94>] ? local_clock+0x34/0x80
[ 2601.452017]  [<ffffffff81061cc8>] ? pvclock_clocksource_read+0x58/0xd0
[ 2601.452017]  [<ffffffff81061cc8>] ? pvclock_clocksource_read+0x58/0xd0
[ 2601.452017]  [<ffffffff81060edc>] ? kvm_clock_read+0x1c/0x20
[ 2601.452017]  [<ffffffff8103e2c9>] ? sched_clock+0x9/0x10
[ 2601.452017]  [<ffffffff810bb054>] ? lock_acquire+0xb4/0x140
[ 2601.452017]  [<ffffffff8103e2c9>] ? sched_clock+0x9/0x10
[ 2601.452017]  [<ffffffff810a98e5>] ? sched_clock_local+0x25/0x90
[ 2601.452017]  [<ffffffff81698ea2>] ? prop_get_global+0x32/0x50
[ 2601.452017]  [<ffffffff81699230>] ? prop_fraction_percpu+0x30/0xa0
[ 2601.452017]  [<ffffffff8111af3b>] ? bdi_dirty_limit+0x9b/0xe0
[ 2601.452017]  [<ffffffff8111bbd8>] ? balance_dirty_pages_ratelimited_nr+0x178/0x580
[ 2601.452017]  [<ffffffff81ad440b>] ? _raw_spin_unlock+0x2b/0x40
[ 2601.452017]  [<ffffffff8117ccd5>] ? __mark_inode_dirty+0xc5/0x230
[ 2601.452017]  [<ffffffff811114d5>] ? iov_iter_copy_from_user_atomic+0x95/0x170
[ 2601.452017]  [<ffffffff811118fc>] ? generic_file_buffered_write+0x1cc/0x270
[ 2601.452017]  [<ffffffff81492f2f>] ? xfs_file_aio_write+0x79f/0xaf0
[ 2601.452017]  [<ffffffff81060edc>] ? kvm_clock_read+0x1c/0x20
[ 2601.452017]  [<ffffffff81060edc>] ? kvm_clock_read+0x1c/0x20
[ 2601.452017]  [<ffffffff8103e2c9>] ? sched_clock+0x9/0x10
[ 2601.452017]  [<ffffffff810a98e5>] ? sched_clock_local+0x25/0x90
[ 2601.452017]  [<ffffffff81157cca>] ? do_sync_write+0xda/0x120
[ 2601.452017]  [<ffffffff8112e20c>] ? might_fault+0x5c/0xb0
[ 2601.452017]  [<ffffffff81669f7f>] ? security_file_permission+0x1f/0x80
[ 2601.452017]  [<ffffffff81157fb8>] ? vfs_write+0xc8/0x180
[ 2601.452017]  [<ffffffff81158904>] ? sys_write+0x54/0x90
[ 2601.452017]  [<ffffffff81037072>] ? system_call_fastpath+0x16/0x1b

This is on an 8p/4GB RAM VM.

FWIW, this one test now has a proven record of exposing writeback,
VM and filesystem regressions, so I'd suggest that anyone doing any
sort of work that affects writeback adds it to their test matrix....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
  2010-10-13  3:07     ` Dave Chinner
@ 2010-10-13  3:23       ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-10-13  3:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-mm, LKML, Andrew Morton,
	Theodore Ts'o, Jan Kara, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Chris Mason, Christoph Hellwig,
	Li, Shaohua

On Wed, Oct 13, 2010 at 11:07:33AM +0800, Dave Chinner wrote:
> On Tue, Oct 12, 2010 at 10:17:16AM -0400, Christoph Hellwig wrote:
> > Wu, what's the state of this series?  It looks like we'll need it
> > rather sooner than later - try to get at least the preparations in
> > ASAP would be really helpful.
> 
> Not ready in it's current form. This load (creating millions of 1
> byte files in parallel):
> 
> $ /usr/bin/time ./fs_mark -D 10000 -S0 -n 100000 -s 1 -L 63 \
> > -d /mnt/scratch/0 -d /mnt/scratch/1 \
> > -d /mnt/scratch/2 -d /mnt/scratch/3 \
> > -d /mnt/scratch/4 -d /mnt/scratch/5 \
> > -d /mnt/scratch/6 -d /mnt/scratch/7
> 
> Locks up all the fs_mark processes spinning in traces like the
> following and no further progress is made when the inode cache
> fills memory.
 
Dave, thanks for the testing! I'll try to reproduce it and check
what's going on.

Thanks,
Fengguang

> [ 2601.452017] fs_mark       R  running task        0  2303   2235 0x00000008
> [ 2601.452017]  ffff8801188f7878 ffffffff8103e2c9 ffff8801188f78a8 0000000000000000
> [ 2601.452017]  0000000000000002 ffff8801129e21c0 ffff880002fd44c0 0000000000000000
> [ 2601.452017]  ffff8801188f78b8 ffffffff810a9a08 ffff8801188f78e8 ffffffff810a98e5
> [ 2601.452017] Call Trace:
> [ 2601.452017]  [<ffffffff81060edc>] ? kvm_clock_read+0x1c/0x20
> [ 2601.452017]  [<ffffffff8103e2c9>] ? sched_clock+0x9/0x10
> [ 2601.452017]  [<ffffffff810a98e5>] ? sched_clock_local+0x25/0x90
> [ 2601.452017]  [<ffffffff810b9e00>] ? __lock_acquire+0x330/0x14d0
> [ 2601.452017]  [<ffffffff810a9a94>] ? local_clock+0x34/0x80
> [ 2601.452017]  [<ffffffff81061cc8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 2601.452017]  [<ffffffff81061cc8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 2601.452017]  [<ffffffff81060edc>] ? kvm_clock_read+0x1c/0x20
> [ 2601.452017]  [<ffffffff8103e2c9>] ? sched_clock+0x9/0x10
> [ 2601.452017]  [<ffffffff810bb054>] ? lock_acquire+0xb4/0x140
> [ 2601.452017]  [<ffffffff8103e2c9>] ? sched_clock+0x9/0x10
> [ 2601.452017]  [<ffffffff810a98e5>] ? sched_clock_local+0x25/0x90
> [ 2601.452017]  [<ffffffff81698ea2>] ? prop_get_global+0x32/0x50
> [ 2601.452017]  [<ffffffff81699230>] ? prop_fraction_percpu+0x30/0xa0
> [ 2601.452017]  [<ffffffff8111af3b>] ? bdi_dirty_limit+0x9b/0xe0
> [ 2601.452017]  [<ffffffff8111bbd8>] ? balance_dirty_pages_ratelimited_nr+0x178/0x580
> [ 2601.452017]  [<ffffffff81ad440b>] ? _raw_spin_unlock+0x2b/0x40
> [ 2601.452017]  [<ffffffff8117ccd5>] ? __mark_inode_dirty+0xc5/0x230
> [ 2601.452017]  [<ffffffff811114d5>] ? iov_iter_copy_from_user_atomic+0x95/0x170
> [ 2601.452017]  [<ffffffff811118fc>] ? generic_file_buffered_write+0x1cc/0x270
> [ 2601.452017]  [<ffffffff81492f2f>] ? xfs_file_aio_write+0x79f/0xaf0
> [ 2601.452017]  [<ffffffff81060edc>] ? kvm_clock_read+0x1c/0x20
> [ 2601.452017]  [<ffffffff81060edc>] ? kvm_clock_read+0x1c/0x20
> [ 2601.452017]  [<ffffffff8103e2c9>] ? sched_clock+0x9/0x10
> [ 2601.452017]  [<ffffffff810a98e5>] ? sched_clock_local+0x25/0x90
> [ 2601.452017]  [<ffffffff81157cca>] ? do_sync_write+0xda/0x120
> [ 2601.452017]  [<ffffffff8112e20c>] ? might_fault+0x5c/0xb0
> [ 2601.452017]  [<ffffffff81669f7f>] ? security_file_permission+0x1f/0x80
> [ 2601.452017]  [<ffffffff81157fb8>] ? vfs_write+0xc8/0x180
> [ 2601.452017]  [<ffffffff81158904>] ? sys_write+0x54/0x90
> [ 2601.452017]  [<ffffffff81037072>] ? system_call_fastpath+0x16/0x1b
> 
> This is on an 8p/4GB RAM VM.
> 
> FWIW, this one test now has a proven record of exposing writeback,
> VM and filesystem regressions, so I'd suggest that anyone doing any
> sort of work that affects writeback adds it to their test matrix....

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
@ 2010-10-13  3:23       ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-10-13  3:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-mm, LKML, Andrew Morton,
	Theodore Ts'o, Jan Kara, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Chris Mason, Christoph Hellwig,
	Li, Shaohua

On Wed, Oct 13, 2010 at 11:07:33AM +0800, Dave Chinner wrote:
> On Tue, Oct 12, 2010 at 10:17:16AM -0400, Christoph Hellwig wrote:
> > Wu, what's the state of this series?  It looks like we'll need it
> > rather sooner than later - try to get at least the preparations in
> > ASAP would be really helpful.
> 
> Not ready in it's current form. This load (creating millions of 1
> byte files in parallel):
> 
> $ /usr/bin/time ./fs_mark -D 10000 -S0 -n 100000 -s 1 -L 63 \
> > -d /mnt/scratch/0 -d /mnt/scratch/1 \
> > -d /mnt/scratch/2 -d /mnt/scratch/3 \
> > -d /mnt/scratch/4 -d /mnt/scratch/5 \
> > -d /mnt/scratch/6 -d /mnt/scratch/7
> 
> Locks up all the fs_mark processes spinning in traces like the
> following and no further progress is made when the inode cache
> fills memory.
 
Dave, thanks for the testing! I'll try to reproduce it and check
what's going on.

Thanks,
Fengguang

> [ 2601.452017] fs_mark       R  running task        0  2303   2235 0x00000008
> [ 2601.452017]  ffff8801188f7878 ffffffff8103e2c9 ffff8801188f78a8 0000000000000000
> [ 2601.452017]  0000000000000002 ffff8801129e21c0 ffff880002fd44c0 0000000000000000
> [ 2601.452017]  ffff8801188f78b8 ffffffff810a9a08 ffff8801188f78e8 ffffffff810a98e5
> [ 2601.452017] Call Trace:
> [ 2601.452017]  [<ffffffff81060edc>] ? kvm_clock_read+0x1c/0x20
> [ 2601.452017]  [<ffffffff8103e2c9>] ? sched_clock+0x9/0x10
> [ 2601.452017]  [<ffffffff810a98e5>] ? sched_clock_local+0x25/0x90
> [ 2601.452017]  [<ffffffff810b9e00>] ? __lock_acquire+0x330/0x14d0
> [ 2601.452017]  [<ffffffff810a9a94>] ? local_clock+0x34/0x80
> [ 2601.452017]  [<ffffffff81061cc8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 2601.452017]  [<ffffffff81061cc8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 2601.452017]  [<ffffffff81060edc>] ? kvm_clock_read+0x1c/0x20
> [ 2601.452017]  [<ffffffff8103e2c9>] ? sched_clock+0x9/0x10
> [ 2601.452017]  [<ffffffff810bb054>] ? lock_acquire+0xb4/0x140
> [ 2601.452017]  [<ffffffff8103e2c9>] ? sched_clock+0x9/0x10
> [ 2601.452017]  [<ffffffff810a98e5>] ? sched_clock_local+0x25/0x90
> [ 2601.452017]  [<ffffffff81698ea2>] ? prop_get_global+0x32/0x50
> [ 2601.452017]  [<ffffffff81699230>] ? prop_fraction_percpu+0x30/0xa0
> [ 2601.452017]  [<ffffffff8111af3b>] ? bdi_dirty_limit+0x9b/0xe0
> [ 2601.452017]  [<ffffffff8111bbd8>] ? balance_dirty_pages_ratelimited_nr+0x178/0x580
> [ 2601.452017]  [<ffffffff81ad440b>] ? _raw_spin_unlock+0x2b/0x40
> [ 2601.452017]  [<ffffffff8117ccd5>] ? __mark_inode_dirty+0xc5/0x230
> [ 2601.452017]  [<ffffffff811114d5>] ? iov_iter_copy_from_user_atomic+0x95/0x170
> [ 2601.452017]  [<ffffffff811118fc>] ? generic_file_buffered_write+0x1cc/0x270
> [ 2601.452017]  [<ffffffff81492f2f>] ? xfs_file_aio_write+0x79f/0xaf0
> [ 2601.452017]  [<ffffffff81060edc>] ? kvm_clock_read+0x1c/0x20
> [ 2601.452017]  [<ffffffff81060edc>] ? kvm_clock_read+0x1c/0x20
> [ 2601.452017]  [<ffffffff8103e2c9>] ? sched_clock+0x9/0x10
> [ 2601.452017]  [<ffffffff810a98e5>] ? sched_clock_local+0x25/0x90
> [ 2601.452017]  [<ffffffff81157cca>] ? do_sync_write+0xda/0x120
> [ 2601.452017]  [<ffffffff8112e20c>] ? might_fault+0x5c/0xb0
> [ 2601.452017]  [<ffffffff81669f7f>] ? security_file_permission+0x1f/0x80
> [ 2601.452017]  [<ffffffff81157fb8>] ? vfs_write+0xc8/0x180
> [ 2601.452017]  [<ffffffff81158904>] ? sys_write+0x54/0x90
> [ 2601.452017]  [<ffffffff81037072>] ? system_call_fastpath+0x16/0x1b
> 
> This is on an 8p/4GB RAM VM.
> 
> FWIW, this one test now has a proven record of exposing writeback,
> VM and filesystem regressions, so I'd suggest that anyone doing any
> sort of work that affects writeback adds it to their test matrix....

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
  2010-10-13  3:07     ` Dave Chinner
@ 2010-10-13  8:26       ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-10-13  8:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-mm, LKML, Andrew Morton,
	Theodore Ts'o, Jan Kara, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Chris Mason, Christoph Hellwig,
	Li, Shaohua

On Wed, Oct 13, 2010 at 11:07:33AM +0800, Dave Chinner wrote:
> On Tue, Oct 12, 2010 at 10:17:16AM -0400, Christoph Hellwig wrote:
> > Wu, what's the state of this series?  It looks like we'll need it
> > rather sooner than later - try to get at least the preparations in
> > ASAP would be really helpful.
> 
> Not ready in it's current form. This load (creating millions of 1
> byte files in parallel):
> 
> $ /usr/bin/time ./fs_mark -D 10000 -S0 -n 100000 -s 1 -L 63 \
> > -d /mnt/scratch/0 -d /mnt/scratch/1 \
> > -d /mnt/scratch/2 -d /mnt/scratch/3 \
> > -d /mnt/scratch/4 -d /mnt/scratch/5 \
> > -d /mnt/scratch/6 -d /mnt/scratch/7
> 
> Locks up all the fs_mark processes spinning in traces like the
> following and no further progress is made when the inode cache
> fills memory.

I reproduced the problem on a 6G/8p 2-socket 11-disk box.

The root cause is, pageout() is somehow called with low scan priority,
which deserves more investigation.

The direct cause is, balance_dirty_pages() then keeps nr_dirty too low,
which can be improved easily by not pushing down the soft dirty limit
to less than 1-second worth of dirty pages.

My test box has two nodes, and their memory usage are rather unbalanced:
(Dave, maybe you have NUMA setup too?)

root@wfg-ne02 ~# cat /sys/devices/system/node/node0/meminfo
root@wfg-ne02 ~# cat /sys/devices/system/node/node1/meminfo

                          Node 0         Node 1
        ------------------------------------------
        MemTotal:        3133760 kB     3145728 kB
==>     MemFree:          453016 kB     2283884 kB
==>     MemUsed:         2680744 kB      861844 kB
        Active:           436436 kB        9744 kB
        Inactive:         846400 kB       37196 kB
        Active(anon):     113304 kB        1588 kB
        Inactive(anon):      412 kB           0 kB
        Active(file):     323132 kB        8156 kB
        Inactive(file):   845988 kB       37196 kB
        Unevictable:           0 kB           0 kB
        Mlocked:               0 kB           0 kB
        Dirty:               244 kB           0 kB
        Writeback:             0 kB           0 kB
        FilePages:       1169832 kB       45352 kB
        Mapped:             9088 kB           0 kB
        AnonPages:        113596 kB        1588 kB
        Shmem:               416 kB           0 kB
        KernelStack:        1472 kB           8 kB
        PageTables:         2600 kB           0 kB
        NFS_Unstable:          0 kB           0 kB
        Bounce:                0 kB           0 kB
        WritebackTmp:          0 kB           0 kB
        Slab:            1133616 kB      701972 kB
        SReclaimable:     902552 kB      693048 kB
        SUnreclaim:       231064 kB        8924 kB
        HugePages_Total:     0              0
        HugePages_Free:      0              0
        HugePages_Surp:      0              0

And somehow pageout() is called with very low scan priority, hence
the vm_dirty_pressure introduced in patch "mm: lower soft dirty limits on
memory pressure" goes all the way down to 0, which makes balance_dirty_pages()
start aggressive dirty throttling.

root@wfg-ne02 ~# cat /debug/vm/dirty_pressure              
0
root@wfg-ne02 ~# echo 1024 > /debug/vm/dirty_pressure

After restoring vm_dirty_pressure the performance immediately restores:

# vmmon nr_free_pages nr_anon_pages nr_file_pages nr_dirty nr_writeback nr_slab_reclaimable slabs_scanned

    nr_free_pages    nr_anon_pages    nr_file_pages         nr_dirty     nr_writeback nr_slab_reclaimable    slabs_scanned          
           870915            13165           337210             1602             8394           221271          2910208
           869924            13206           338116             1532             8293           221414          2910208
           868889            13245           338977             1403             7764           221515          2910208
           867892            13359           339669             1327             8071           221579          2910208
--- vm_dirty_pressure restores from here on ---------------------------------------------------------------------------
           866354            13358           341162             2290             8290           221665          2910208
           863627            13419           343259             4014             8332           221833          2910208
           861008            13662           344968             5854             8333           222092          2910208
           858513            13601           347019             7622             8333           222371          2910208
           855272            13693           348987             9449             8333           223301          2910208
           851959            13789           350898            11287             8333           224273          2910208
           848641            13878           352812            13060             8333           225223          2910208
           845398            13967           354822            14854             8333           226193          2910208
           842216            14062           356749            16684             8333           227148          2910208
           838844            14152           358862            18500             8333           228129          2910208
           835447            14245           360678            20313             8333           229084          2910208
           832265            14338           362561            22117             8333           230058          2910208
           829098            14429           364710            23906             8333           231005          2910208
           825609            14520           366530            25726             8333           231971          2910208

# dstat
        ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
        usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
          0   6  82   0   0  12|   0  2240k| 766B 8066B|   0     0 |1435  1649 
          0   4  85   0   0  11|   0  2266k| 262B  436B|   0     0 |1141  1055 
          0   5  83   0   0  12|   0  2196k| 630B 7132B|   0     0 |1144  1053 
          0   6  81   0   0  13|   0  2424k|1134B   20k|   0     0 |1284  1282 
          0   7  81   0   0  12|   0  2152k| 628B 4660B|   0     0 |1634  1944 
          0   4  84   0   0  12|   0  2184k| 192B  580B|   0     0 |1133  1037 
          0   4  84   0   0  12|   0  2440k| 192B  564B|   0     0 |1197  1124 
--- vm_dirty_pressure restores from here on -----------------------------------
          0  51  35   0   0  14| 112k 6718k|  20k   17k|   0     0 |2539  1478 
          1  83   0   0   0  17|   0    13M| 252B  564B|   0     0 |3221  1270 
          0  78   6   0   0  16|   0    15M|1434B   12k|   0     0 |3596  1590 
          0  83   1   0   0  16|   0    13M| 324B 4154B|   0     0 |3318  1374 
          0  80   4   1   0  16|   0    14M|1706B 9824B|   0     0 |3469  1632 
          0  76   5   1   0  18|   0    15M| 636B 4558B|   0     0 |3777  1940 
          0  71   9   1   0  19|   0    17M| 510B 3068B|   0     0 |4018  2277 

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
@ 2010-10-13  8:26       ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-10-13  8:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-mm, LKML, Andrew Morton,
	Theodore Ts'o, Jan Kara, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Chris Mason, Christoph Hellwig,
	Li, Shaohua

On Wed, Oct 13, 2010 at 11:07:33AM +0800, Dave Chinner wrote:
> On Tue, Oct 12, 2010 at 10:17:16AM -0400, Christoph Hellwig wrote:
> > Wu, what's the state of this series?  It looks like we'll need it
> > rather sooner than later - try to get at least the preparations in
> > ASAP would be really helpful.
> 
> Not ready in it's current form. This load (creating millions of 1
> byte files in parallel):
> 
> $ /usr/bin/time ./fs_mark -D 10000 -S0 -n 100000 -s 1 -L 63 \
> > -d /mnt/scratch/0 -d /mnt/scratch/1 \
> > -d /mnt/scratch/2 -d /mnt/scratch/3 \
> > -d /mnt/scratch/4 -d /mnt/scratch/5 \
> > -d /mnt/scratch/6 -d /mnt/scratch/7
> 
> Locks up all the fs_mark processes spinning in traces like the
> following and no further progress is made when the inode cache
> fills memory.

I reproduced the problem on a 6G/8p 2-socket 11-disk box.

The root cause is, pageout() is somehow called with low scan priority,
which deserves more investigation.

The direct cause is, balance_dirty_pages() then keeps nr_dirty too low,
which can be improved easily by not pushing down the soft dirty limit
to less than 1-second worth of dirty pages.

My test box has two nodes, and their memory usage are rather unbalanced:
(Dave, maybe you have NUMA setup too?)

root@wfg-ne02 ~# cat /sys/devices/system/node/node0/meminfo
root@wfg-ne02 ~# cat /sys/devices/system/node/node1/meminfo

                          Node 0         Node 1
        ------------------------------------------
        MemTotal:        3133760 kB     3145728 kB
==>     MemFree:          453016 kB     2283884 kB
==>     MemUsed:         2680744 kB      861844 kB
        Active:           436436 kB        9744 kB
        Inactive:         846400 kB       37196 kB
        Active(anon):     113304 kB        1588 kB
        Inactive(anon):      412 kB           0 kB
        Active(file):     323132 kB        8156 kB
        Inactive(file):   845988 kB       37196 kB
        Unevictable:           0 kB           0 kB
        Mlocked:               0 kB           0 kB
        Dirty:               244 kB           0 kB
        Writeback:             0 kB           0 kB
        FilePages:       1169832 kB       45352 kB
        Mapped:             9088 kB           0 kB
        AnonPages:        113596 kB        1588 kB
        Shmem:               416 kB           0 kB
        KernelStack:        1472 kB           8 kB
        PageTables:         2600 kB           0 kB
        NFS_Unstable:          0 kB           0 kB
        Bounce:                0 kB           0 kB
        WritebackTmp:          0 kB           0 kB
        Slab:            1133616 kB      701972 kB
        SReclaimable:     902552 kB      693048 kB
        SUnreclaim:       231064 kB        8924 kB
        HugePages_Total:     0              0
        HugePages_Free:      0              0
        HugePages_Surp:      0              0

And somehow pageout() is called with very low scan priority, hence
the vm_dirty_pressure introduced in patch "mm: lower soft dirty limits on
memory pressure" goes all the way down to 0, which makes balance_dirty_pages()
start aggressive dirty throttling.

root@wfg-ne02 ~# cat /debug/vm/dirty_pressure              
0
root@wfg-ne02 ~# echo 1024 > /debug/vm/dirty_pressure

After restoring vm_dirty_pressure the performance immediately restores:

# vmmon nr_free_pages nr_anon_pages nr_file_pages nr_dirty nr_writeback nr_slab_reclaimable slabs_scanned

    nr_free_pages    nr_anon_pages    nr_file_pages         nr_dirty     nr_writeback nr_slab_reclaimable    slabs_scanned          
           870915            13165           337210             1602             8394           221271          2910208
           869924            13206           338116             1532             8293           221414          2910208
           868889            13245           338977             1403             7764           221515          2910208
           867892            13359           339669             1327             8071           221579          2910208
--- vm_dirty_pressure restores from here on ---------------------------------------------------------------------------
           866354            13358           341162             2290             8290           221665          2910208
           863627            13419           343259             4014             8332           221833          2910208
           861008            13662           344968             5854             8333           222092          2910208
           858513            13601           347019             7622             8333           222371          2910208
           855272            13693           348987             9449             8333           223301          2910208
           851959            13789           350898            11287             8333           224273          2910208
           848641            13878           352812            13060             8333           225223          2910208
           845398            13967           354822            14854             8333           226193          2910208
           842216            14062           356749            16684             8333           227148          2910208
           838844            14152           358862            18500             8333           228129          2910208
           835447            14245           360678            20313             8333           229084          2910208
           832265            14338           362561            22117             8333           230058          2910208
           829098            14429           364710            23906             8333           231005          2910208
           825609            14520           366530            25726             8333           231971          2910208

# dstat
        ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
        usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
          0   6  82   0   0  12|   0  2240k| 766B 8066B|   0     0 |1435  1649 
          0   4  85   0   0  11|   0  2266k| 262B  436B|   0     0 |1141  1055 
          0   5  83   0   0  12|   0  2196k| 630B 7132B|   0     0 |1144  1053 
          0   6  81   0   0  13|   0  2424k|1134B   20k|   0     0 |1284  1282 
          0   7  81   0   0  12|   0  2152k| 628B 4660B|   0     0 |1634  1944 
          0   4  84   0   0  12|   0  2184k| 192B  580B|   0     0 |1133  1037 
          0   4  84   0   0  12|   0  2440k| 192B  564B|   0     0 |1197  1124 
--- vm_dirty_pressure restores from here on -----------------------------------
          0  51  35   0   0  14| 112k 6718k|  20k   17k|   0     0 |2539  1478 
          1  83   0   0   0  17|   0    13M| 252B  564B|   0     0 |3221  1270 
          0  78   6   0   0  16|   0    15M|1434B   12k|   0     0 |3596  1590 
          0  83   1   0   0  16|   0    13M| 324B 4154B|   0     0 |3318  1374 
          0  80   4   1   0  16|   0    14M|1706B 9824B|   0     0 |3469  1632 
          0  76   5   1   0  18|   0    15M| 636B 4558B|   0     0 |3777  1940 
          0  71   9   1   0  19|   0    17M| 510B 3068B|   0     0 |4018  2277 

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
  2010-10-13  8:26       ` Wu Fengguang
@ 2010-10-13  9:26         ` Dave Chinner
  -1 siblings, 0 replies; 98+ messages in thread
From: Dave Chinner @ 2010-10-13  9:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, linux-mm, LKML, Andrew Morton,
	Theodore Ts'o, Jan Kara, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Chris Mason, Christoph Hellwig,
	Li, Shaohua

On Wed, Oct 13, 2010 at 04:26:12PM +0800, Wu Fengguang wrote:
> On Wed, Oct 13, 2010 at 11:07:33AM +0800, Dave Chinner wrote:
> > On Tue, Oct 12, 2010 at 10:17:16AM -0400, Christoph Hellwig wrote:
> > > Wu, what's the state of this series?  It looks like we'll need it
> > > rather sooner than later - try to get at least the preparations in
> > > ASAP would be really helpful.
> > 
> > Not ready in it's current form. This load (creating millions of 1
> > byte files in parallel):
> > 
> > $ /usr/bin/time ./fs_mark -D 10000 -S0 -n 100000 -s 1 -L 63 \
> > > -d /mnt/scratch/0 -d /mnt/scratch/1 \
> > > -d /mnt/scratch/2 -d /mnt/scratch/3 \
> > > -d /mnt/scratch/4 -d /mnt/scratch/5 \
> > > -d /mnt/scratch/6 -d /mnt/scratch/7
> > 
> > Locks up all the fs_mark processes spinning in traces like the
> > following and no further progress is made when the inode cache
> > fills memory.
> 
> I reproduced the problem on a 6G/8p 2-socket 11-disk box.
> 
> The root cause is, pageout() is somehow called with low scan priority,
> which deserves more investigation.
> 
> The direct cause is, balance_dirty_pages() then keeps nr_dirty too low,
> which can be improved easily by not pushing down the soft dirty limit
> to less than 1-second worth of dirty pages.
> 
> My test box has two nodes, and their memory usage are rather unbalanced:
> (Dave, maybe you have NUMA setup too?)

No, I'm running the test in a single node VM.

FYI, I'm running the test on XFS (16TB 12 disk RAID0 stripe), using
the mount options "inode64,nobarrier,logbsize=262144,delaylog".

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
@ 2010-10-13  9:26         ` Dave Chinner
  0 siblings, 0 replies; 98+ messages in thread
From: Dave Chinner @ 2010-10-13  9:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, linux-mm, LKML, Andrew Morton,
	Theodore Ts'o, Jan Kara, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Chris Mason, Christoph Hellwig,
	Li, Shaohua

On Wed, Oct 13, 2010 at 04:26:12PM +0800, Wu Fengguang wrote:
> On Wed, Oct 13, 2010 at 11:07:33AM +0800, Dave Chinner wrote:
> > On Tue, Oct 12, 2010 at 10:17:16AM -0400, Christoph Hellwig wrote:
> > > Wu, what's the state of this series?  It looks like we'll need it
> > > rather sooner than later - try to get at least the preparations in
> > > ASAP would be really helpful.
> > 
> > Not ready in it's current form. This load (creating millions of 1
> > byte files in parallel):
> > 
> > $ /usr/bin/time ./fs_mark -D 10000 -S0 -n 100000 -s 1 -L 63 \
> > > -d /mnt/scratch/0 -d /mnt/scratch/1 \
> > > -d /mnt/scratch/2 -d /mnt/scratch/3 \
> > > -d /mnt/scratch/4 -d /mnt/scratch/5 \
> > > -d /mnt/scratch/6 -d /mnt/scratch/7
> > 
> > Locks up all the fs_mark processes spinning in traces like the
> > following and no further progress is made when the inode cache
> > fills memory.
> 
> I reproduced the problem on a 6G/8p 2-socket 11-disk box.
> 
> The root cause is, pageout() is somehow called with low scan priority,
> which deserves more investigation.
> 
> The direct cause is, balance_dirty_pages() then keeps nr_dirty too low,
> which can be improved easily by not pushing down the soft dirty limit
> to less than 1-second worth of dirty pages.
> 
> My test box has two nodes, and their memory usage are rather unbalanced:
> (Dave, maybe you have NUMA setup too?)

No, I'm running the test in a single node VM.

FYI, I'm running the test on XFS (16TB 12 disk RAID0 stripe), using
the mount options "inode64,nobarrier,logbsize=262144,delaylog".

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
  2010-10-12 14:17   ` Christoph Hellwig
@ 2010-10-14 13:12     ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-10-14 13:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, LKML, Andrew Morton, Theodore Ts'o, Dave Chinner,
	Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li, Shaohua

Hi Christoph,

On Tue, Oct 12, 2010 at 10:17:16PM +0800, Christoph Hellwig wrote:
> Wu, what's the state of this series?  It looks like we'll need it
> rather sooner than later - try to get at least the preparations in
> ASAP would be really helpful.

Sorry I was doing some audio work in the last month and will be attending
the China Linux Storage and Filesystem workshop and kernel developers
conference these days. I'll be able to pick up this series on 10.18.

Sorry for the delay!

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
@ 2010-10-14 13:12     ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-10-14 13:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, LKML, Andrew Morton, Theodore Ts'o, Dave Chinner,
	Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, Li, Shaohua

Hi Christoph,

On Tue, Oct 12, 2010 at 10:17:16PM +0800, Christoph Hellwig wrote:
> Wu, what's the state of this series?  It looks like we'll need it
> rather sooner than later - try to get at least the preparations in
> ASAP would be really helpful.

Sorry I was doing some audio work in the last month and will be attending
the China Linux Storage and Filesystem workshop and kernel developers
conference these days. I'll be able to pick up this series on 10.18.

Sorry for the delay!

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
  2010-10-13  9:26         ` Dave Chinner
@ 2010-11-01  6:24           ` Dave Chinner
  -1 siblings, 0 replies; 98+ messages in thread
From: Dave Chinner @ 2010-11-01  6:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, linux-mm, LKML, Andrew Morton,
	Theodore Ts'o, Jan Kara, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Chris Mason, Christoph Hellwig,
	Li, Shaohua

On Wed, Oct 13, 2010 at 08:26:27PM +1100, Dave Chinner wrote:
> On Wed, Oct 13, 2010 at 04:26:12PM +0800, Wu Fengguang wrote:
> > On Wed, Oct 13, 2010 at 11:07:33AM +0800, Dave Chinner wrote:
> > > On Tue, Oct 12, 2010 at 10:17:16AM -0400, Christoph Hellwig wrote:
> > > > Wu, what's the state of this series?  It looks like we'll need it
> > > > rather sooner than later - try to get at least the preparations in
> > > > ASAP would be really helpful.
> > > 
> > > Not ready in it's current form. This load (creating millions of 1
> > > byte files in parallel):
> > > 
> > > $ /usr/bin/time ./fs_mark -D 10000 -S0 -n 100000 -s 1 -L 63 \
> > > > -d /mnt/scratch/0 -d /mnt/scratch/1 \
> > > > -d /mnt/scratch/2 -d /mnt/scratch/3 \
> > > > -d /mnt/scratch/4 -d /mnt/scratch/5 \
> > > > -d /mnt/scratch/6 -d /mnt/scratch/7
> > > 
> > > Locks up all the fs_mark processes spinning in traces like the
> > > following and no further progress is made when the inode cache
> > > fills memory.
> > 
> > I reproduced the problem on a 6G/8p 2-socket 11-disk box.
> > 
> > The root cause is, pageout() is somehow called with low scan priority,
> > which deserves more investigation.
> > 
> > The direct cause is, balance_dirty_pages() then keeps nr_dirty too low,
> > which can be improved easily by not pushing down the soft dirty limit
> > to less than 1-second worth of dirty pages.
> > 
> > My test box has two nodes, and their memory usage are rather unbalanced:
> > (Dave, maybe you have NUMA setup too?)
> 
> No, I'm running the test in a single node VM.
> 
> FYI, I'm running the test on XFS (16TB 12 disk RAID0 stripe), using
> the mount options "inode64,nobarrier,logbsize=262144,delaylog".

Any update on the current status of this patchset?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
@ 2010-11-01  6:24           ` Dave Chinner
  0 siblings, 0 replies; 98+ messages in thread
From: Dave Chinner @ 2010-11-01  6:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, linux-mm, LKML, Andrew Morton,
	Theodore Ts'o, Jan Kara, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Chris Mason, Christoph Hellwig,
	Li, Shaohua

On Wed, Oct 13, 2010 at 08:26:27PM +1100, Dave Chinner wrote:
> On Wed, Oct 13, 2010 at 04:26:12PM +0800, Wu Fengguang wrote:
> > On Wed, Oct 13, 2010 at 11:07:33AM +0800, Dave Chinner wrote:
> > > On Tue, Oct 12, 2010 at 10:17:16AM -0400, Christoph Hellwig wrote:
> > > > Wu, what's the state of this series?  It looks like we'll need it
> > > > rather sooner than later - try to get at least the preparations in
> > > > ASAP would be really helpful.
> > > 
> > > Not ready in it's current form. This load (creating millions of 1
> > > byte files in parallel):
> > > 
> > > $ /usr/bin/time ./fs_mark -D 10000 -S0 -n 100000 -s 1 -L 63 \
> > > > -d /mnt/scratch/0 -d /mnt/scratch/1 \
> > > > -d /mnt/scratch/2 -d /mnt/scratch/3 \
> > > > -d /mnt/scratch/4 -d /mnt/scratch/5 \
> > > > -d /mnt/scratch/6 -d /mnt/scratch/7
> > > 
> > > Locks up all the fs_mark processes spinning in traces like the
> > > following and no further progress is made when the inode cache
> > > fills memory.
> > 
> > I reproduced the problem on a 6G/8p 2-socket 11-disk box.
> > 
> > The root cause is, pageout() is somehow called with low scan priority,
> > which deserves more investigation.
> > 
> > The direct cause is, balance_dirty_pages() then keeps nr_dirty too low,
> > which can be improved easily by not pushing down the soft dirty limit
> > to less than 1-second worth of dirty pages.
> > 
> > My test box has two nodes, and their memory usage are rather unbalanced:
> > (Dave, maybe you have NUMA setup too?)
> 
> No, I'm running the test in a single node VM.
> 
> FYI, I'm running the test on XFS (16TB 12 disk RAID0 stripe), using
> the mount options "inode64,nobarrier,logbsize=262144,delaylog".

Any update on the current status of this patchset?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
  2010-11-01  6:24           ` Dave Chinner
@ 2010-11-04  3:41             ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-11-04  3:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-mm, LKML, Andrew Morton,
	Theodore Ts'o, Jan Kara, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Chris Mason, Christoph Hellwig,
	Li, Shaohua

Hi Dave,

On Mon, Nov 01, 2010 at 02:24:46PM +0800, Dave Chinner wrote:
> On Wed, Oct 13, 2010 at 08:26:27PM +1100, Dave Chinner wrote:
> > On Wed, Oct 13, 2010 at 04:26:12PM +0800, Wu Fengguang wrote:
> > > On Wed, Oct 13, 2010 at 11:07:33AM +0800, Dave Chinner wrote:
> > > > On Tue, Oct 12, 2010 at 10:17:16AM -0400, Christoph Hellwig wrote:
> > > > > Wu, what's the state of this series?  It looks like we'll need it
> > > > > rather sooner than later - try to get at least the preparations in
> > > > > ASAP would be really helpful.
> > > > 
> > > > Not ready in it's current form. This load (creating millions of 1
> > > > byte files in parallel):
> > > > 
> > > > $ /usr/bin/time ./fs_mark -D 10000 -S0 -n 100000 -s 1 -L 63 \
> > > > > -d /mnt/scratch/0 -d /mnt/scratch/1 \
> > > > > -d /mnt/scratch/2 -d /mnt/scratch/3 \
> > > > > -d /mnt/scratch/4 -d /mnt/scratch/5 \
> > > > > -d /mnt/scratch/6 -d /mnt/scratch/7
> > > > 
> > > > Locks up all the fs_mark processes spinning in traces like the
> > > > following and no further progress is made when the inode cache
> > > > fills memory.
> > > 
> > > I reproduced the problem on a 6G/8p 2-socket 11-disk box.
> > > 
> > > The root cause is, pageout() is somehow called with low scan priority,
> > > which deserves more investigation.
> > > 
> > > The direct cause is, balance_dirty_pages() then keeps nr_dirty too low,
> > > which can be improved easily by not pushing down the soft dirty limit
> > > to less than 1-second worth of dirty pages.
> > > 
> > > My test box has two nodes, and their memory usage are rather unbalanced:
> > > (Dave, maybe you have NUMA setup too?)
> > 
> > No, I'm running the test in a single node VM.
> > 
> > FYI, I'm running the test on XFS (16TB 12 disk RAID0 stripe), using
> > the mount options "inode64,nobarrier,logbsize=262144,delaylog".
> 
> Any update on the current status of this patchset?

The last 3 patches to dynamically lower the 20% dirty limit seem
to hurt writeback throughput when it goes too small. That's not
surprising. I tried moderately increase the low bound of dynamic
dirty limit but tests show that it's still not enough. Days ago I
came up with another low bound scheme, however the test box has
been running LKP (and other) benchmarks for the new -rc1 release..

Anyway I see some tricky points in deciding the low bound for dynamic
dirty limit. It seems reasonable to bypass this feature for now, and
to test/submit the other important parts first.

I'm feeling relatively good about the first 14 patches to do IO-less
balance_dirty_pages() and larger writeback chunk size. I'll repost
them separately as v2 after returning to Shanghai.

Some days ago I prepared some slides which has some figures on the old
and new dirty throttling schemes. Hope it helps.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling.pdf

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
@ 2010-11-04  3:41             ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-11-04  3:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-mm, LKML, Andrew Morton,
	Theodore Ts'o, Jan Kara, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Chris Mason, Christoph Hellwig,
	Li, Shaohua

Hi Dave,

On Mon, Nov 01, 2010 at 02:24:46PM +0800, Dave Chinner wrote:
> On Wed, Oct 13, 2010 at 08:26:27PM +1100, Dave Chinner wrote:
> > On Wed, Oct 13, 2010 at 04:26:12PM +0800, Wu Fengguang wrote:
> > > On Wed, Oct 13, 2010 at 11:07:33AM +0800, Dave Chinner wrote:
> > > > On Tue, Oct 12, 2010 at 10:17:16AM -0400, Christoph Hellwig wrote:
> > > > > Wu, what's the state of this series?  It looks like we'll need it
> > > > > rather sooner than later - try to get at least the preparations in
> > > > > ASAP would be really helpful.
> > > > 
> > > > Not ready in it's current form. This load (creating millions of 1
> > > > byte files in parallel):
> > > > 
> > > > $ /usr/bin/time ./fs_mark -D 10000 -S0 -n 100000 -s 1 -L 63 \
> > > > > -d /mnt/scratch/0 -d /mnt/scratch/1 \
> > > > > -d /mnt/scratch/2 -d /mnt/scratch/3 \
> > > > > -d /mnt/scratch/4 -d /mnt/scratch/5 \
> > > > > -d /mnt/scratch/6 -d /mnt/scratch/7
> > > > 
> > > > Locks up all the fs_mark processes spinning in traces like the
> > > > following and no further progress is made when the inode cache
> > > > fills memory.
> > > 
> > > I reproduced the problem on a 6G/8p 2-socket 11-disk box.
> > > 
> > > The root cause is, pageout() is somehow called with low scan priority,
> > > which deserves more investigation.
> > > 
> > > The direct cause is, balance_dirty_pages() then keeps nr_dirty too low,
> > > which can be improved easily by not pushing down the soft dirty limit
> > > to less than 1-second worth of dirty pages.
> > > 
> > > My test box has two nodes, and their memory usage are rather unbalanced:
> > > (Dave, maybe you have NUMA setup too?)
> > 
> > No, I'm running the test in a single node VM.
> > 
> > FYI, I'm running the test on XFS (16TB 12 disk RAID0 stripe), using
> > the mount options "inode64,nobarrier,logbsize=262144,delaylog".
> 
> Any update on the current status of this patchset?

The last 3 patches to dynamically lower the 20% dirty limit seem
to hurt writeback throughput when it goes too small. That's not
surprising. I tried moderately increase the low bound of dynamic
dirty limit but tests show that it's still not enough. Days ago I
came up with another low bound scheme, however the test box has
been running LKP (and other) benchmarks for the new -rc1 release..

Anyway I see some tricky points in deciding the low bound for dynamic
dirty limit. It seems reasonable to bypass this feature for now, and
to test/submit the other important parts first.

I'm feeling relatively good about the first 14 patches to do IO-less
balance_dirty_pages() and larger writeback chunk size. I'll repost
them separately as v2 after returning to Shanghai.

Some days ago I prepared some slides which has some figures on the old
and new dirty throttling schemes. Hope it helps.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling.pdf

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
  2010-11-04  3:41             ` Wu Fengguang
@ 2010-11-04 12:48               ` Dave Chinner
  -1 siblings, 0 replies; 98+ messages in thread
From: Dave Chinner @ 2010-11-04 12:48 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, linux-mm, LKML, Andrew Morton,
	Theodore Ts'o, Jan Kara, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Chris Mason, Christoph Hellwig,
	Li, Shaohua

On Thu, Nov 04, 2010 at 11:41:19AM +0800, Wu Fengguang wrote:
> Hi Dave,
> 
> On Mon, Nov 01, 2010 at 02:24:46PM +0800, Dave Chinner wrote:
> > On Wed, Oct 13, 2010 at 08:26:27PM +1100, Dave Chinner wrote:
> > > On Wed, Oct 13, 2010 at 04:26:12PM +0800, Wu Fengguang wrote:
> > > > On Wed, Oct 13, 2010 at 11:07:33AM +0800, Dave Chinner wrote:
> > > > > On Tue, Oct 12, 2010 at 10:17:16AM -0400, Christoph Hellwig wrote:
> > > > > > Wu, what's the state of this series?  It looks like we'll need it
> > > > > > rather sooner than later - try to get at least the preparations in
> > > > > > ASAP would be really helpful.
> > > > > 
> > > > > Not ready in it's current form. This load (creating millions of 1
> > > > > byte files in parallel):
> > > > > 
> > > > > $ /usr/bin/time ./fs_mark -D 10000 -S0 -n 100000 -s 1 -L 63 \
> > > > > > -d /mnt/scratch/0 -d /mnt/scratch/1 \
> > > > > > -d /mnt/scratch/2 -d /mnt/scratch/3 \
> > > > > > -d /mnt/scratch/4 -d /mnt/scratch/5 \
> > > > > > -d /mnt/scratch/6 -d /mnt/scratch/7
> > > > > 
> > > > > Locks up all the fs_mark processes spinning in traces like the
> > > > > following and no further progress is made when the inode cache
> > > > > fills memory.
> > > > 
> > > > I reproduced the problem on a 6G/8p 2-socket 11-disk box.
> > > > 
> > > > The root cause is, pageout() is somehow called with low scan priority,
> > > > which deserves more investigation.
> > > > 
> > > > The direct cause is, balance_dirty_pages() then keeps nr_dirty too low,
> > > > which can be improved easily by not pushing down the soft dirty limit
> > > > to less than 1-second worth of dirty pages.
> > > > 
> > > > My test box has two nodes, and their memory usage are rather unbalanced:
> > > > (Dave, maybe you have NUMA setup too?)
> > > 
> > > No, I'm running the test in a single node VM.
> > > 
> > > FYI, I'm running the test on XFS (16TB 12 disk RAID0 stripe), using
> > > the mount options "inode64,nobarrier,logbsize=262144,delaylog".
> > 
> > Any update on the current status of this patchset?
> 
> The last 3 patches to dynamically lower the 20% dirty limit seem
> to hurt writeback throughput when it goes too small. That's not
> surprising. I tried moderately increase the low bound of dynamic
> dirty limit but tests show that it's still not enough. Days ago I
> came up with another low bound scheme, however the test box has
> been running LKP (and other) benchmarks for the new -rc1 release..
> 
> Anyway I see some tricky points in deciding the low bound for dynamic
> dirty limit. It seems reasonable to bypass this feature for now, and
> to test/submit the other important parts first.
> 
> I'm feeling relatively good about the first 14 patches to do IO-less
> balance_dirty_pages() and larger writeback chunk size. I'll repost
> them separately as v2 after returning to Shanghai.

As I've pointed out already, increasing the writeback chunk size is
not a good idea to do, so I'd suggest that it should be separated
from the IO-less balance_dirty_pages() series.

> Some days ago I prepared some slides which has some figures on the old
> and new dirty throttling schemes. Hope it helps.
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling.pdf

Pretty colours, but doesn't really add much to what I already
understood from your series description. I guess it loses something
without someone talking about them.... :/ 

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
@ 2010-11-04 12:48               ` Dave Chinner
  0 siblings, 0 replies; 98+ messages in thread
From: Dave Chinner @ 2010-11-04 12:48 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, linux-mm, LKML, Andrew Morton,
	Theodore Ts'o, Jan Kara, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Chris Mason, Christoph Hellwig,
	Li, Shaohua

On Thu, Nov 04, 2010 at 11:41:19AM +0800, Wu Fengguang wrote:
> Hi Dave,
> 
> On Mon, Nov 01, 2010 at 02:24:46PM +0800, Dave Chinner wrote:
> > On Wed, Oct 13, 2010 at 08:26:27PM +1100, Dave Chinner wrote:
> > > On Wed, Oct 13, 2010 at 04:26:12PM +0800, Wu Fengguang wrote:
> > > > On Wed, Oct 13, 2010 at 11:07:33AM +0800, Dave Chinner wrote:
> > > > > On Tue, Oct 12, 2010 at 10:17:16AM -0400, Christoph Hellwig wrote:
> > > > > > Wu, what's the state of this series?  It looks like we'll need it
> > > > > > rather sooner than later - try to get at least the preparations in
> > > > > > ASAP would be really helpful.
> > > > > 
> > > > > Not ready in it's current form. This load (creating millions of 1
> > > > > byte files in parallel):
> > > > > 
> > > > > $ /usr/bin/time ./fs_mark -D 10000 -S0 -n 100000 -s 1 -L 63 \
> > > > > > -d /mnt/scratch/0 -d /mnt/scratch/1 \
> > > > > > -d /mnt/scratch/2 -d /mnt/scratch/3 \
> > > > > > -d /mnt/scratch/4 -d /mnt/scratch/5 \
> > > > > > -d /mnt/scratch/6 -d /mnt/scratch/7
> > > > > 
> > > > > Locks up all the fs_mark processes spinning in traces like the
> > > > > following and no further progress is made when the inode cache
> > > > > fills memory.
> > > > 
> > > > I reproduced the problem on a 6G/8p 2-socket 11-disk box.
> > > > 
> > > > The root cause is, pageout() is somehow called with low scan priority,
> > > > which deserves more investigation.
> > > > 
> > > > The direct cause is, balance_dirty_pages() then keeps nr_dirty too low,
> > > > which can be improved easily by not pushing down the soft dirty limit
> > > > to less than 1-second worth of dirty pages.
> > > > 
> > > > My test box has two nodes, and their memory usage are rather unbalanced:
> > > > (Dave, maybe you have NUMA setup too?)
> > > 
> > > No, I'm running the test in a single node VM.
> > > 
> > > FYI, I'm running the test on XFS (16TB 12 disk RAID0 stripe), using
> > > the mount options "inode64,nobarrier,logbsize=262144,delaylog".
> > 
> > Any update on the current status of this patchset?
> 
> The last 3 patches to dynamically lower the 20% dirty limit seem
> to hurt writeback throughput when it goes too small. That's not
> surprising. I tried moderately increase the low bound of dynamic
> dirty limit but tests show that it's still not enough. Days ago I
> came up with another low bound scheme, however the test box has
> been running LKP (and other) benchmarks for the new -rc1 release..
> 
> Anyway I see some tricky points in deciding the low bound for dynamic
> dirty limit. It seems reasonable to bypass this feature for now, and
> to test/submit the other important parts first.
> 
> I'm feeling relatively good about the first 14 patches to do IO-less
> balance_dirty_pages() and larger writeback chunk size. I'll repost
> them separately as v2 after returning to Shanghai.

As I've pointed out already, increasing the writeback chunk size is
not a good idea to do, so I'd suggest that it should be separated
from the IO-less balance_dirty_pages() series.

> Some days ago I prepared some slides which has some figures on the old
> and new dirty throttling schemes. Hope it helps.
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling.pdf

Pretty colours, but doesn't really add much to what I already
understood from your series description. I guess it loses something
without someone talking about them.... :/ 

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
  2010-11-04  3:41             ` Wu Fengguang
@ 2010-11-04 13:12               ` Christoph Hellwig
  -1 siblings, 0 replies; 98+ messages in thread
From: Christoph Hellwig @ 2010-11-04 13:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dave Chinner, Christoph Hellwig, linux-mm, LKML, Andrew Morton,
	Theodore Ts'o, Jan Kara, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Chris Mason, Christoph Hellwig,
	Li, Shaohua

On Thu, Nov 04, 2010 at 11:41:19AM +0800, Wu Fengguang wrote:
> I'm feeling relatively good about the first 14 patches to do IO-less
> balance_dirty_pages() and larger writeback chunk size. I'll repost
> them separately as v2 after returning to Shanghai.

Going for as small as possible patchsets is a pretty good idea.  Just
getting the I/O less balance_dirty_pages on it's own would be a really
good start, as that's one of the really criticial pieces of
infrastructure that a lot of people are waiting for.  Getting it into
linux-mm/linux-next ASAP so that it gets a lot of testing would be
highly useful.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
@ 2010-11-04 13:12               ` Christoph Hellwig
  0 siblings, 0 replies; 98+ messages in thread
From: Christoph Hellwig @ 2010-11-04 13:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dave Chinner, Christoph Hellwig, linux-mm, LKML, Andrew Morton,
	Theodore Ts'o, Jan Kara, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Chris Mason, Christoph Hellwig,
	Li, Shaohua

On Thu, Nov 04, 2010 at 11:41:19AM +0800, Wu Fengguang wrote:
> I'm feeling relatively good about the first 14 patches to do IO-less
> balance_dirty_pages() and larger writeback chunk size. I'll repost
> them separately as v2 after returning to Shanghai.

Going for as small as possible patchsets is a pretty good idea.  Just
getting the I/O less balance_dirty_pages on it's own would be a really
good start, as that's one of the really criticial pieces of
infrastructure that a lot of people are waiting for.  Getting it into
linux-mm/linux-next ASAP so that it gets a lot of testing would be
highly useful.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
  2010-11-04 13:12               ` Christoph Hellwig
@ 2010-11-05 14:56                 ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-11-05 14:56 UTC (permalink / raw)
  To: Christoph Hellwig, Andrew Morton
  Cc: Dave Chinner, Christoph Hellwig, linux-mm, LKML, Andrew Morton,
	Theodore Ts'o, Jan Kara, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Chris Mason, Li, Shaohua,
	Greg Thelen

On Thu, Nov 04, 2010 at 09:12:28PM +0800, Christoph Hellwig wrote:
> On Thu, Nov 04, 2010 at 11:41:19AM +0800, Wu Fengguang wrote:
> > I'm feeling relatively good about the first 14 patches to do IO-less
> > balance_dirty_pages() and larger writeback chunk size. I'll repost
> > them separately as v2 after returning to Shanghai.
> 
> Going for as small as possible patchsets is a pretty good idea.  Just
> getting the I/O less balance_dirty_pages on it's own would be a really
> good start, as that's one of the really criticial pieces of
> infrastructure that a lot of people are waiting for.  Getting it into
> linux-mm/linux-next ASAP so that it gets a lot of testing would be
> highly useful.

OK, I'll do a smaller IO-less balance_dirty_pages() patchset (it's
good to know which part is the most relevant one, which is not always
obvious by my limited field experiences), which will further reduce
the possible risk of unexpected regressions.

Currently the -mm tree includes Greg's patchset "memcg: per cgroup
dirty page accounting". I'm going to rebase my patches onto it,
however I'd like to first make sure if Greg's patches are going to be
pushed in the next merge window. I personally have no problem with
that.  Andrew?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
@ 2010-11-05 14:56                 ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-11-05 14:56 UTC (permalink / raw)
  To: Christoph Hellwig, Andrew Morton
  Cc: Dave Chinner, Christoph Hellwig, linux-mm, LKML,
	Theodore Ts'o, Jan Kara, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Chris Mason, Li, Shaohua,
	Greg Thelen

On Thu, Nov 04, 2010 at 09:12:28PM +0800, Christoph Hellwig wrote:
> On Thu, Nov 04, 2010 at 11:41:19AM +0800, Wu Fengguang wrote:
> > I'm feeling relatively good about the first 14 patches to do IO-less
> > balance_dirty_pages() and larger writeback chunk size. I'll repost
> > them separately as v2 after returning to Shanghai.
> 
> Going for as small as possible patchsets is a pretty good idea.  Just
> getting the I/O less balance_dirty_pages on it's own would be a really
> good start, as that's one of the really criticial pieces of
> infrastructure that a lot of people are waiting for.  Getting it into
> linux-mm/linux-next ASAP so that it gets a lot of testing would be
> highly useful.

OK, I'll do a smaller IO-less balance_dirty_pages() patchset (it's
good to know which part is the most relevant one, which is not always
obvious by my limited field experiences), which will further reduce
the possible risk of unexpected regressions.

Currently the -mm tree includes Greg's patchset "memcg: per cgroup
dirty page accounting". I'm going to rebase my patches onto it,
however I'd like to first make sure if Greg's patches are going to be
pushed in the next merge window. I personally have no problem with
that.  Andrew?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
  2010-11-05 14:56                 ` Wu Fengguang
@ 2010-11-06 10:42                   ` Dave Chinner
  -1 siblings, 0 replies; 98+ messages in thread
From: Dave Chinner @ 2010-11-06 10:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Andrew Morton, Christoph Hellwig, linux-mm,
	LKML, Theodore Ts'o, Jan Kara, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Chris Mason, Li, Shaohua,
	Greg Thelen

On Fri, Nov 05, 2010 at 10:56:39PM +0800, Wu Fengguang wrote:
> On Thu, Nov 04, 2010 at 09:12:28PM +0800, Christoph Hellwig wrote:
> > On Thu, Nov 04, 2010 at 11:41:19AM +0800, Wu Fengguang wrote:
> > > I'm feeling relatively good about the first 14 patches to do IO-less
> > > balance_dirty_pages() and larger writeback chunk size. I'll repost
> > > them separately as v2 after returning to Shanghai.
> > 
> > Going for as small as possible patchsets is a pretty good idea.  Just
> > getting the I/O less balance_dirty_pages on it's own would be a really
> > good start, as that's one of the really criticial pieces of
> > infrastructure that a lot of people are waiting for.  Getting it into
> > linux-mm/linux-next ASAP so that it gets a lot of testing would be
> > highly useful.
> 
> OK, I'll do a smaller IO-less balance_dirty_pages() patchset (it's
> good to know which part is the most relevant one, which is not always
> obvious by my limited field experiences), which will further reduce
> the possible risk of unexpected regressions.

Which is good given the recent history of writeback mods. :/

> Currently the -mm tree includes Greg's patchset "memcg: per cgroup
> dirty page accounting". I'm going to rebase my patches onto it,
> however I'd like to first make sure if Greg's patches are going to be
> pushed in the next merge window. I personally have no problem with
> that.  Andrew?

Well, I'd prefer that you provide a git tree that I can just pull
into my current working branch to test. Having to pull in a thousand
other changes to test your writeback changes makes it much harder
for me as I'd have to establish a new stable performance/behavioural
baseline before starting to analyse your series. If it's based on
mainline then I've already got that baseline....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits
@ 2010-11-06 10:42                   ` Dave Chinner
  0 siblings, 0 replies; 98+ messages in thread
From: Dave Chinner @ 2010-11-06 10:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Andrew Morton, Christoph Hellwig, linux-mm,
	LKML, Theodore Ts'o, Jan Kara, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Chris Mason, Li, Shaohua,
	Greg Thelen

On Fri, Nov 05, 2010 at 10:56:39PM +0800, Wu Fengguang wrote:
> On Thu, Nov 04, 2010 at 09:12:28PM +0800, Christoph Hellwig wrote:
> > On Thu, Nov 04, 2010 at 11:41:19AM +0800, Wu Fengguang wrote:
> > > I'm feeling relatively good about the first 14 patches to do IO-less
> > > balance_dirty_pages() and larger writeback chunk size. I'll repost
> > > them separately as v2 after returning to Shanghai.
> > 
> > Going for as small as possible patchsets is a pretty good idea.  Just
> > getting the I/O less balance_dirty_pages on it's own would be a really
> > good start, as that's one of the really criticial pieces of
> > infrastructure that a lot of people are waiting for.  Getting it into
> > linux-mm/linux-next ASAP so that it gets a lot of testing would be
> > highly useful.
> 
> OK, I'll do a smaller IO-less balance_dirty_pages() patchset (it's
> good to know which part is the most relevant one, which is not always
> obvious by my limited field experiences), which will further reduce
> the possible risk of unexpected regressions.

Which is good given the recent history of writeback mods. :/

> Currently the -mm tree includes Greg's patchset "memcg: per cgroup
> dirty page accounting". I'm going to rebase my patches onto it,
> however I'd like to first make sure if Greg's patches are going to be
> pushed in the next merge window. I personally have no problem with
> that.  Andrew?

Well, I'd prefer that you provide a git tree that I can just pull
into my current working branch to test. Having to pull in a thousand
other changes to test your writeback changes makes it much harder
for me as I'd have to establish a new stable performance/behavioural
baseline before starting to analyse your series. If it's based on
mainline then I've already got that baseline....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

end of thread, other threads:[~2010-11-06 10:45 UTC | newest]

Thread overview: 98+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-12 15:49 [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits Wu Fengguang
2010-09-12 15:49 ` Wu Fengguang
2010-09-12 15:49 ` [PATCH 01/17] writeback: remove the internal 5% low bound on dirty_ratio Wu Fengguang
2010-09-12 15:49   ` Wu Fengguang
2010-09-13  9:23   ` Johannes Weiner
2010-09-13  9:23     ` Johannes Weiner
2010-09-13  9:51   ` Mel Gorman
2010-09-13  9:51     ` Mel Gorman
2010-09-13  9:57     ` Wu Fengguang
2010-09-13  9:57       ` Wu Fengguang
2010-09-13 10:10       ` Mel Gorman
2010-09-13 10:10         ` Mel Gorman
2010-09-12 15:49 ` [PATCH 02/17] writeback: IO-less balance_dirty_pages() Wu Fengguang
2010-09-12 15:49   ` Wu Fengguang
2010-09-13  8:45   ` Dave Chinner
2010-09-13  8:45     ` Dave Chinner
2010-09-13 11:38     ` Wu Fengguang
2010-09-13 11:38       ` Wu Fengguang
2010-09-12 15:49 ` [PATCH 03/17] writeback: per-task rate limit to balance_dirty_pages() Wu Fengguang
2010-09-12 15:49   ` Wu Fengguang
2010-09-12 15:49 ` [PATCH 04/17] writeback: quit throttling when bdi dirty/writeback pages go down Wu Fengguang
2010-09-12 15:49   ` Wu Fengguang
2010-09-12 15:49 ` [PATCH 05/17] writeback: quit throttling when signal pending Wu Fengguang
2010-09-12 15:49   ` Wu Fengguang
2010-09-12 20:46   ` Neil Brown
2010-09-12 20:46     ` Neil Brown
2010-09-13  1:55     ` Wu Fengguang
2010-09-13  1:55       ` Wu Fengguang
2010-09-13  3:21       ` Neil Brown
2010-09-13  3:21         ` Neil Brown
2010-09-13  3:48         ` Wu Fengguang
2010-09-13  3:48           ` Wu Fengguang
2010-09-14  8:23           ` KOSAKI Motohiro
2010-09-14  8:23             ` KOSAKI Motohiro
2010-09-14  8:33             ` Wu Fengguang
2010-09-14  8:33               ` Wu Fengguang
2010-09-14  8:44               ` KOSAKI Motohiro
2010-09-14  8:44                 ` KOSAKI Motohiro
2010-09-14  9:17                 ` Wu Fengguang
2010-09-14  9:17                   ` Wu Fengguang
2010-09-14  9:25                   ` KOSAKI Motohiro
2010-09-14  9:25                     ` KOSAKI Motohiro
2010-09-12 15:49 ` [PATCH 06/17] writeback: move task dirty fraction to balance_dirty_pages() Wu Fengguang
2010-09-12 15:49   ` Wu Fengguang
2010-09-12 15:49 ` [PATCH 07/17] writeback: add trace event for balance_dirty_pages() Wu Fengguang
2010-09-12 15:49   ` Wu Fengguang
2010-09-12 15:49 ` [PATCH 08/17] writeback: account per-bdi accumulated written pages Wu Fengguang
2010-09-12 15:49   ` Wu Fengguang
2010-09-12 15:59   ` Wu Fengguang
2010-09-12 15:59     ` Wu Fengguang
2010-09-14  8:32   ` KOSAKI Motohiro
2010-09-14  8:32     ` KOSAKI Motohiro
2010-09-12 15:49 ` [PATCH 09/17] writeback: bdi write bandwidth estimation Wu Fengguang
2010-09-12 15:49   ` Wu Fengguang
2010-09-12 15:49 ` [PATCH 10/17] writeback: show bdi write bandwidth in debugfs Wu Fengguang
2010-09-12 15:49   ` Wu Fengguang
2010-09-12 15:49 ` [PATCH 11/17] writeback: make nr_to_write a per-file limit Wu Fengguang
2010-09-12 15:49   ` Wu Fengguang
2010-09-12 15:49 ` [PATCH 12/17] writeback: scale IO chunk size up to device bandwidth Wu Fengguang
2010-09-12 15:49   ` Wu Fengguang
2010-09-12 15:49 ` [PATCH 13/17] writeback: reduce per-bdi dirty threshold ramp up time Wu Fengguang
2010-09-12 15:49   ` Wu Fengguang
2010-09-12 16:15   ` Wu Fengguang
2010-09-12 16:15     ` Wu Fengguang
2010-09-12 15:49 ` [PATCH 14/17] vmscan: add scan_control.priority Wu Fengguang
2010-09-12 15:49   ` Wu Fengguang
2010-09-12 15:50 ` [PATCH 15/17] mm: lower soft dirty limits on memory pressure Wu Fengguang
2010-09-12 15:50   ` Wu Fengguang
2010-09-13  9:40   ` Wu Fengguang
2010-09-13  9:40     ` Wu Fengguang
2010-09-12 15:50 ` [PATCH 16/17] mm: create /vm/dirty_pressure in debugfs Wu Fengguang
2010-09-12 15:50   ` Wu Fengguang
2010-09-12 15:50 ` [PATCH 17/17] writeback: consolidate balance_dirty_pages() variable names Wu Fengguang
2010-09-12 15:50   ` Wu Fengguang
2010-10-12 14:17 ` [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits Christoph Hellwig
2010-10-12 14:17   ` Christoph Hellwig
2010-10-13  3:07   ` Dave Chinner
2010-10-13  3:07     ` Dave Chinner
2010-10-13  3:23     ` Wu Fengguang
2010-10-13  3:23       ` Wu Fengguang
2010-10-13  8:26     ` Wu Fengguang
2010-10-13  8:26       ` Wu Fengguang
2010-10-13  9:26       ` Dave Chinner
2010-10-13  9:26         ` Dave Chinner
2010-11-01  6:24         ` Dave Chinner
2010-11-01  6:24           ` Dave Chinner
2010-11-04  3:41           ` Wu Fengguang
2010-11-04  3:41             ` Wu Fengguang
2010-11-04 12:48             ` Dave Chinner
2010-11-04 12:48               ` Dave Chinner
2010-11-04 13:12             ` Christoph Hellwig
2010-11-04 13:12               ` Christoph Hellwig
2010-11-05 14:56               ` Wu Fengguang
2010-11-05 14:56                 ` Wu Fengguang
2010-11-06 10:42                 ` Dave Chinner
2010-11-06 10:42                   ` Dave Chinner
2010-10-14 13:12   ` Wu Fengguang
2010-10-14 13:12     ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.