linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/11] IO-less dirty throttling v12
@ 2011-10-03 13:42 Wu Fengguang
  2011-10-03 13:42 ` [PATCH 01/11] writeback: account per-bdi accumulated dirtied pages Wu Fengguang
                   ` (14 more replies)
  0 siblings, 15 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-10-03 13:42 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML, Wu Fengguang

Hi,

This is the minimal IO-less balance_dirty_pages() changes that are expected to
be regression free (well, except for NFS).

        git://github.com/fengguang/linux.git dirty-throttling-v12

Tests results will be posted in a separate email.

Changes since v11:

- improve bdi reserve area parameters (based on test results)
- drop bdi underrun flag
- drop aux bdi control line
- make bdi->dirty_ratelimit more stable

Changes since v10:

- complete the renames
- add protections for IO queue underrun
  - pause time reduction
  - bdi reserve area
  - bdi underrun flag
- more accurate task dirty accounting for
  - sub-page writes
  - FS re-dirties
  - short lived tasks

Changes since v9:

- a lot of renames and comment/changelog rework, again
- seperate out the dirty_ratelimit update policy (as patch 04)
- add think time compensation
- add 3 trace events

Changes since v8:

- a lot of renames and comment/changelog rework
- use 3rd order polynomial as the global control line (Peter)
- stabilize dirty_ratelimit by decreasing update step size on small errors
- limit per-CPU dirtied pages to avoid dirty pages run away on 1k+ tasks (Peter)

Thanks a lot to Peter, Vivek, Andrea and Jan for the careful reviews!

shortlog:

Wu Fengguang (11):
      writeback: account per-bdi accumulated dirtied pages
      writeback: dirty position control
      writeback: add bg_threshold parameter to __bdi_update_bandwidth()
      writeback: dirty rate control
      writeback: stabilize bdi->dirty_ratelimit
      writeback: per task dirty rate limit
      writeback: IO-less balance_dirty_pages()
      writeback: limit max dirty pause time
      writeback: control dirty pause time
      writeback: dirty position control - bdi reserve area
      writeback: per-bdi background threshold

diffstat:

 fs/fs-writeback.c                |   19 +-
 include/linux/backing-dev.h      |   11 +
 include/linux/sched.h            |    7 +
 include/linux/writeback.h        |    1 +
 include/trace/events/writeback.h |   24 --
 kernel/fork.c                    |    3 +
 mm/backing-dev.c                 |    4 +
 mm/page-writeback.c              |  678 +++++++++++++++++++++++++++++---------
 8 files changed, 566 insertions(+), 181 deletions(-)

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 01/11] writeback: account per-bdi accumulated dirtied pages
  2011-10-03 13:42 [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
@ 2011-10-03 13:42 ` Wu Fengguang
  2011-10-03 13:42 ` [PATCH 02/11] writeback: dirty position control Wu Fengguang
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-10-03 13:42 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Jan Kara, Michael Rubin, Wu Fengguang,
	Andrew Morton, Christoph Hellwig, Dave Chinner, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-bdi-dirtied.patch --]
[-- Type: text/plain, Size: 2019 bytes --]

Introduce the BDI_DIRTIED counter. It will be used for estimating the
bdi's dirty bandwidth.

CC: Jan Kara <jack@suse.cz>
CC: Michael Rubin <mrubin@google.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    2 ++
 mm/page-writeback.c         |    1 +
 3 files changed, 4 insertions(+)

--- linux-next.orig/include/linux/backing-dev.h	2011-10-03 21:05:32.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-10-03 21:05:33.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_DIRTIED,
 	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
--- linux-next.orig/mm/page-writeback.c	2011-10-03 21:05:32.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-10-03 21:05:33.000000000 +0800
@@ -1322,6 +1322,7 @@ void account_page_dirtied(struct page *p
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
 	}
--- linux-next.orig/mm/backing-dev.c	2011-10-03 21:05:32.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-10-03 21:05:33.000000000 +0800
@@ -97,6 +97,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:     %10lu kB\n"
 		   "DirtyThresh:        %10lu kB\n"
 		   "BackgroundThresh:   %10lu kB\n"
+		   "BdiDirtied:         %10lu kB\n"
 		   "BdiWritten:         %10lu kB\n"
 		   "BdiWriteBandwidth:  %10lu kBps\n"
 		   "b_dirty:            %10lu\n"
@@ -109,6 +110,7 @@ static int bdi_debug_stats_show(struct s
 		   K(bdi_thresh),
 		   K(dirty_thresh),
 		   K(background_thresh),
+		   (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
 		   (unsigned long) K(bdi->write_bandwidth),
 		   nr_dirty,



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 02/11] writeback: dirty position control
  2011-10-03 13:42 [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
  2011-10-03 13:42 ` [PATCH 01/11] writeback: account per-bdi accumulated dirtied pages Wu Fengguang
@ 2011-10-03 13:42 ` Wu Fengguang
  2011-10-03 13:42 ` [PATCH 03/11] writeback: add bg_threshold parameter to __bdi_update_bandwidth() Wu Fengguang
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-10-03 13:42 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Jan Kara, Wu Fengguang, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 11896 bytes --]

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range

	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],

we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = bdi_setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the writeout bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to bdi_setpoint ~= setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

Given equations

        span = x_intercept - bdi_setpoint
        k = df/dx = - 1 / span

and the extremum values

        span = bdi_thresh
        dx = bdi_thresh

we get

        df = - dx / span = - 1.0

That means, when bdi_dirty deviates bdi_thresh up, pos_ratio and hence
task ratelimit will fluctuate by -100%.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |  191 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 190 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2011-10-03 11:28:31.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-10-03 21:05:10.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -495,6 +503,184 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = dirty_ratelimit * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            *
+ *     |              *
+ *     |                *
+ *     |                  *
+ *     |                    * |<=========== span ============>|
+ * 1.0 .......................*
+ *     |                      . *
+ *     |                      .   *
+ *     |                      .     *
+ *     |                      .       *
+ *     |                      .         *
+ *     |                      .           *
+ *     |                      .             *
+ *     |                      .               *
+ *     |                      .                 *
+ *     |                      .                   *
+ *     |                      .                     *
+ * 1/4 ...............................................* * * * * * * * * * * *
+ *     |                      .                         .
+ *     |                      .                           .
+ *     |                      .                             .
+ *   0 +----------------------.-------------------------------.------------->
+ *                bdi_setpoint^                    x_intercept^
+ *
+ * The bdi control line won't drop below pos_ratio=1/4, so that bdi_dirty can
+ * be smoothly throttled down to normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long write_bw = bdi->avg_write_bandwidth;
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* dirty pages' target balance point */
+	unsigned long bdi_setpoint;
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                           setpoint - dirty 3
+	 *        f(dirty) := 1.0 + (----------------)
+	 *                           limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup dirty_ratelimit reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx      <= 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * We have computed basic pos_ratio above based on global situation. If
+	 * the bdi is over/under its share of dirty pages, we want to scale
+	 * pos_ratio further down/up. That is done by the following mechanism.
+	 */
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
+	 *
+	 *                        x_intercept - bdi_dirty
+	 *                     := --------------------------
+	 *                        x_intercept - bdi_setpoint
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(bdi_setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
+	 * for various filesystems, where (2) can yield in a reasonable 12.5%
+	 * fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly and choose a slope that
+	 * yields 100% pos_ratio fluctuation on suddenly doubled bdi_thresh.
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:
+	 *	bdi_setpoint = setpoint * bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh + 1);
+	bdi_setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(8*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 *
+	 *        bdi_thresh                    thresh - bdi_thresh
+	 * span = ---------- * (8 * write_bw) + ------------------- * bdi_thresh
+	 *          thresh                            thresh
+	 */
+	span = (thresh - bdi_thresh + 8 * write_bw) * (u64)x >> 16;
+	x_intercept = bdi_setpoint + span;
+
+	if (bdi_dirty < x_intercept - span / 4) {
+		pos_ratio *= x_intercept - bdi_dirty;
+		do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
+	} else
+		pos_ratio /= 4;
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -655,6 +841,7 @@ static void balance_dirty_pages(struct a
 	unsigned long nr_reclaimable, bdi_nr_reclaimable;
 	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	unsigned long bdi_dirty;
+	unsigned long freerun;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
@@ -679,7 +866,9 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		freerun = dirty_freerun_ceiling(dirty_thresh,
+						background_thresh);
+		if (nr_dirty <= freerun)
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 03/11] writeback: add bg_threshold parameter to __bdi_update_bandwidth()
  2011-10-03 13:42 [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
  2011-10-03 13:42 ` [PATCH 01/11] writeback: account per-bdi accumulated dirtied pages Wu Fengguang
  2011-10-03 13:42 ` [PATCH 02/11] writeback: dirty position control Wu Fengguang
@ 2011-10-03 13:42 ` Wu Fengguang
  2011-10-03 13:42 ` [PATCH 04/11] writeback: dirty rate control Wu Fengguang
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-10-03 13:42 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-interface-add-bg_thresh --]
[-- Type: text/plain, Size: 2669 bytes --]

No behavior change.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 +-
 include/linux/writeback.h |    1 +
 mm/page-writeback.c       |   11 +++++++----
 3 files changed, 9 insertions(+), 5 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-10-03 20:44:55.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-10-03 20:46:17.000000000 +0800
@@ -779,6 +779,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -815,6 +816,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -823,8 +825,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -912,8 +914,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-10-03 20:44:51.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-10-03 20:45:27.000000000 +0800
@@ -675,7 +675,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-10-03 20:44:51.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-10-03 20:45:27.000000000 +0800
@@ -143,6 +143,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 04/11] writeback: dirty rate control
  2011-10-03 13:42 [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
                   ` (2 preceding siblings ...)
  2011-10-03 13:42 ` [PATCH 03/11] writeback: add bg_threshold parameter to __bdi_update_bandwidth() Wu Fengguang
@ 2011-10-03 13:42 ` Wu Fengguang
  2011-10-03 13:42 ` [PATCH 05/11] writeback: stabilize bdi->dirty_ratelimit Wu Fengguang
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-10-03 13:42 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: dirty-ratelimit --]
[-- Type: text/plain, Size: 10000 bytes --]

It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
when there are N dd tasks.

On write() syscall, use bdi->dirty_ratelimit
============================================

    balance_dirty_pages(pages_dirtied)
    {
        task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
        pause = pages_dirtied / task_ratelimit;
        sleep(pause);
    }

On every 200ms, update bdi->dirty_ratelimit
===========================================

    bdi_update_dirty_ratelimit()
    {
        task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
        balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate;
        bdi->dirty_ratelimit = balanced_dirty_ratelimit
    }

Estimation of balanced bdi->dirty_ratelimit
===========================================

balanced task_ratelimit
-----------------------

balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.

IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:

        dirty_rate == write_bw                                          (1)

The fairness requirement gives us:

        task_ratelimit = balanced_dirty_ratelimit
                       == write_bw / N                                  (2)

where N is the number of dd tasks.  We don't know N beforehand, but
still can estimate balanced_dirty_ratelimit within 200ms.

Start by throttling each dd task at rate

        task_ratelimit = task_ratelimit_0                               (3)
                         (any non-zero initial value is OK)

After 200ms, we measured

        dirty_rate = # of pages dirtied by all dd's / 200ms
        write_bw   = # of pages written to the disk / 200ms

For the aggressive dd dirtiers, the equality holds

        dirty_rate == N * task_rate
                   == N * task_ratelimit_0                              (4)
Or
        task_ratelimit_0 == dirty_rate / N                              (5)

Now we conclude that the balanced task ratelimit can be estimated by

                                                      write_bw
        balanced_dirty_ratelimit = task_ratelimit_0 * ----------        (6)
                                                      dirty_rate

Because with (4) and (5) we can get the desired equality (1):

                                                       write_bw
        balanced_dirty_ratelimit == (dirty_rate / N) * ----------
                                                       dirty_rate
                                 == write_bw / N

Then using the balanced task ratelimit we can compute task pause times like:

        task_pause = task->nr_dirtied / task_ratelimit

task_ratelimit with position control
------------------------------------

However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.

The dirty position control works by extending (2) to

        task_ratelimit = balanced_dirty_ratelimit * pos_ratio           (7)

where pos_ratio is a negative feedback function that subjects to

1) f(setpoint) = 1.0
2) df/dx < 0

That is, if the dirty pages are ABOVE the setpoint, we throttle each
task a bit more HEAVY than balanced_dirty_ratelimit, so that the dirty
pages are created less fast than they are cleaned, thus DROP to the
setpoints (and the reverse).

Based on (7) and the assumption that both dirty_ratelimit and pos_ratio
remains CONSTANT for the past 200ms, we get

        task_ratelimit_0 = balanced_dirty_ratelimit * pos_ratio         (8)

Putting (8) into (6), we get the formula used in
bdi_update_dirty_ratelimit():

                                                write_bw
        balanced_dirty_ratelimit *= pos_ratio * ----------              (9)
                                                dirty_rate

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
---
 include/linux/backing-dev.h |    7 ++
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |   83 +++++++++++++++++++++++++++++++++-
 3 files changed, 89 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2011-09-29 21:33:39.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-09-29 21:33:59.000000000 +0800
@@ -75,10 +75,17 @@ struct backing_dev_info {
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	unsigned long bw_time_stamp;	/* last time write bw is updated */
+	unsigned long dirtied_stamp;
 	unsigned long written_stamp;	/* pages written at bw_time_stamp */
 	unsigned long write_bandwidth;	/* the estimated write bandwidth */
 	unsigned long avg_write_bandwidth; /* further smoothed write bw */
 
+	/*
+	 * The base dirty throttle rate, re-calculated on every 200ms.
+	 * All the bdi tasks' dirty rate will be curbed under it.
+	 */
+	unsigned long dirty_ratelimit;
+
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
--- linux-next.orig/mm/backing-dev.c	2011-09-29 21:33:39.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-09-29 21:33:59.000000000 +0800
@@ -686,6 +686,7 @@ int bdi_init(struct backing_dev_info *bd
 	bdi->bw_time_stamp = jiffies;
 	bdi->written_stamp = 0;
 
+	bdi->dirty_ratelimit = INIT_BW;
 	bdi->write_bandwidth = INIT_BW;
 	bdi->avg_write_bandwidth = INIT_BW;
 
--- linux-next.orig/mm/page-writeback.c	2011-09-29 21:33:59.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-09-29 21:34:04.000000000 +0800
@@ -777,6 +777,79 @@ static void global_update_bandwidth(unsi
 	spin_unlock(&dirty_lock);
 }
 
+/*
+ * Maintain bdi->dirty_ratelimit, the base dirty throttle rate.
+ *
+ * Normal bdi tasks will be curbed at or below it in long term.
+ * Obviously it should be around (write_bw / N) when there are N dd tasks.
+ */
+static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
+				       unsigned long thresh,
+				       unsigned long bg_thresh,
+				       unsigned long dirty,
+				       unsigned long bdi_thresh,
+				       unsigned long bdi_dirty,
+				       unsigned long dirtied,
+				       unsigned long elapsed)
+{
+	unsigned long write_bw = bdi->avg_write_bandwidth;
+	unsigned long dirty_ratelimit = bdi->dirty_ratelimit;
+	unsigned long dirty_rate;
+	unsigned long task_ratelimit;
+	unsigned long balanced_dirty_ratelimit;
+	unsigned long pos_ratio;
+
+	/*
+	 * The dirty rate will match the writeout rate in long term, except
+	 * when dirty pages are truncated by userspace or re-dirtied by FS.
+	 */
+	dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+
+	pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
+				       bdi_thresh, bdi_dirty);
+	/*
+	 * task_ratelimit reflects each dd's dirty rate for the past 200ms.
+	 */
+	task_ratelimit = (u64)dirty_ratelimit *
+					pos_ratio >> RATELIMIT_CALC_SHIFT;
+	task_ratelimit++; /* it helps rampup dirty_ratelimit from tiny values */
+
+	/*
+	 * A linear estimation of the "balanced" throttle rate. The theory is,
+	 * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
+	 * dirty_rate will be measured to be (N * task_ratelimit). So the below
+	 * formula will yield the balanced rate limit (write_bw / N).
+	 *
+	 * Note that the expanded form is not a pure rate feedback:
+	 *	rate_(i+1) = rate_(i) * (write_bw / dirty_rate)		     (1)
+	 * but also takes pos_ratio into account:
+	 *	rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
+	 *
+	 * (1) is not realistic because pos_ratio also takes part in balancing
+	 * the dirty rate.  Consider the state
+	 *	pos_ratio = 0.5						     (3)
+	 *	rate = 2 * (write_bw / N)				     (4)
+	 * If (1) is used, it will stuck in that state! Because each dd will
+	 * be throttled at
+	 *	task_ratelimit = pos_ratio * rate = (write_bw / N)	     (5)
+	 * yielding
+	 *	dirty_rate = N * task_ratelimit = write_bw		     (6)
+	 * put (6) into (1) we get
+	 *	rate_(i+1) = rate_(i)					     (7)
+	 *
+	 * So we end up using (2) to always keep
+	 *	rate_(i+1) ~= (write_bw / N)				     (8)
+	 * regardless of the value of pos_ratio. As long as (8) is satisfied,
+	 * pos_ratio is able to drive itself to 1.0, which is not only where
+	 * the dirty count meet the setpoint, but also where the slope of
+	 * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
+	 */
+	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
+					   dirty_rate | 1);
+
+	bdi->dirty_ratelimit = max(balanced_dirty_ratelimit, 1UL);
+}
+
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
 			    unsigned long bg_thresh,
@@ -787,6 +860,7 @@ void __bdi_update_bandwidth(struct backi
 {
 	unsigned long now = jiffies;
 	unsigned long elapsed = now - bdi->bw_time_stamp;
+	unsigned long dirtied;
 	unsigned long written;
 
 	/*
@@ -795,6 +869,7 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed < BANDWIDTH_INTERVAL)
 		return;
 
+	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
 
 	/*
@@ -804,12 +879,16 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
 		goto snapshot;
 
-	if (thresh)
+	if (thresh) {
 		global_update_bandwidth(thresh, dirty, now);
-
+		bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
+					   bdi_thresh, bdi_dirty,
+					   dirtied, elapsed);
+	}
 	bdi_update_write_bandwidth(bdi, elapsed, written);
 
 snapshot:
+	bdi->dirtied_stamp = dirtied;
 	bdi->written_stamp = written;
 	bdi->bw_time_stamp = now;
 }



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 05/11] writeback: stabilize bdi->dirty_ratelimit
  2011-10-03 13:42 [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
                   ` (3 preceding siblings ...)
  2011-10-03 13:42 ` [PATCH 04/11] writeback: dirty rate control Wu Fengguang
@ 2011-10-03 13:42 ` Wu Fengguang
  2011-10-03 13:42 ` [PATCH 06/11] writeback: per task dirty rate limit Wu Fengguang
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-10-03 13:42 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: dirty-ratelimit-stablize --]
[-- Type: text/plain, Size: 7761 bytes --]

There are some imperfections in balanced_dirty_ratelimit.

1) large fluctuations

The dirty_rate used for computing balanced_dirty_ratelimit is merely
averaged in the past 200ms (very small comparing to the 3s estimation
period for write_bw), which makes rather dispersed distribution of
balanced_dirty_ratelimit.

It's pretty hard to average out the singular points by increasing the
estimation period. Considering that the averaging technique will
introduce very undesirable time lags, I give it up totally. (btw, the 3s
write_bw averaging time lag is much more acceptable because its impact
is one-way and therefore won't lead to oscillations.)

The more practical way is filtering -- most singular
balanced_dirty_ratelimit points can be filtered out by remembering some
prev_balanced_rate and prev_prev_balanced_rate. However the more
reliable way is to guard balanced_dirty_ratelimit with task_ratelimit.

2) due to truncates and fs redirties, the (write_bw <=> dirty_rate)
match could become unbalanced, which may lead to large systematical
errors in balanced_dirty_ratelimit. The truncates, due to its possibly
bumpy nature, can hardly be compensated smoothly. So let's face it. When
some over-estimated balanced_dirty_ratelimit brings dirty_ratelimit
high, dirty pages will go higher than the setpoint. task_ratelimit will
in turn become lower than dirty_ratelimit.  So if we consider both
balanced_dirty_ratelimit and task_ratelimit and update dirty_ratelimit
only when they are on the same side of dirty_ratelimit, the systematical
errors in balanced_dirty_ratelimit won't be able to bring
dirty_ratelimit far away.

The balanced_dirty_ratelimit estimation may also be inaccurate near
@limit or @freerun, however is less an issue.

3) since we ultimately want to

- keep the fluctuations of task ratelimit as small as possible
- keep the dirty pages around the setpoint as long time as possible

the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (task_ratelimit < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < balanced_dirty_ratelimit),
there is no point to bring up dirty_ratelimit in a hurry only to hurt
both the above two goals.

So, we make use of task_ratelimit to limit the update of dirty_ratelimit
in two ways:

1) avoid changing dirty rate when it's against the position control target
   (the adjusted rate will slow down the progress of dirty pages going
   back to setpoint).

2) limit the step size. task_ratelimit is changing values step by step,
   leaving a consistent trace comparing to the randomly jumping
   balanced_dirty_ratelimit. task_ratelimit also has the nice smaller
   errors in stable state and typically larger errors when there are big
   errors in rate.  So it's a pretty good limiting factor for the step
   size of dirty_ratelimit.

Note that bdi->dirty_ratelimit is always tracking balanced_dirty_ratelimit.
task_ratelimit is merely used as a limiting factor.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    3 +
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |   71 +++++++++++++++++++++++++++++++++-
 3 files changed, 74 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2011-10-02 10:28:27.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-10-03 11:18:20.000000000 +0800
@@ -792,12 +792,17 @@ static void bdi_update_dirty_ratelimit(s
 				       unsigned long dirtied,
 				       unsigned long elapsed)
 {
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long setpoint = (freerun + limit) / 2;
 	unsigned long write_bw = bdi->avg_write_bandwidth;
 	unsigned long dirty_ratelimit = bdi->dirty_ratelimit;
 	unsigned long dirty_rate;
 	unsigned long task_ratelimit;
 	unsigned long balanced_dirty_ratelimit;
 	unsigned long pos_ratio;
+	unsigned long step;
+	unsigned long x;
 
 	/*
 	 * The dirty rate will match the writeout rate in long term, except
@@ -847,7 +852,71 @@ static void bdi_update_dirty_ratelimit(s
 	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
 					   dirty_rate | 1);
 
-	bdi->dirty_ratelimit = max(balanced_dirty_ratelimit, 1UL);
+	/*
+	 * We could safely do this and return immediately:
+	 *
+	 *	bdi->dirty_ratelimit = balanced_dirty_ratelimit;
+	 *
+	 * However to get a more stable dirty_ratelimit, the below elaborated
+	 * code makes use of task_ratelimit to filter out sigular points and
+	 * limit the step size.
+	 *
+	 * The below code essentially only uses the relative value of
+	 *
+	 *	task_ratelimit - dirty_ratelimit
+	 *	= (pos_ratio - 1) * dirty_ratelimit
+	 *
+	 * which reflects the direction and size of dirty position error.
+	 */
+
+	/*
+	 * dirty_ratelimit will follow balanced_dirty_ratelimit iff
+	 * task_ratelimit is on the same side of dirty_ratelimit, too.
+	 * For example, when
+	 * - dirty_ratelimit > balanced_dirty_ratelimit
+	 * - dirty_ratelimit > task_ratelimit (dirty pages are above setpoint)
+	 * lowering dirty_ratelimit will help meet both the position and rate
+	 * control targets. Otherwise, don't update dirty_ratelimit if it will
+	 * only help meet the rate target. After all, what the users ultimately
+	 * feel and care are stable dirty rate and small position error.
+	 *
+	 * |task_ratelimit - dirty_ratelimit| is used to limit the step size
+	 * and filter out the sigular points of balanced_dirty_ratelimit. Which
+	 * keeps jumping around randomly and can even leap far away at times
+	 * due to the small 200ms estimation period of dirty_rate (we want to
+	 * keep that period small to reduce time lags).
+	 */
+	step = 0;
+	if (dirty < setpoint) {
+		x = min(bdi->balanced_dirty_ratelimit,
+			 min(balanced_dirty_ratelimit, task_ratelimit));
+		if (dirty_ratelimit < x)
+			step = x - dirty_ratelimit;
+	} else {
+		x = max(bdi->balanced_dirty_ratelimit,
+			 max(balanced_dirty_ratelimit, task_ratelimit));
+		if (dirty_ratelimit > x)
+			step = dirty_ratelimit - x;
+	}
+
+	/*
+	 * Don't pursue 100% rate matching. It's impossible since the balanced
+	 * rate itself is constantly fluctuating. So decrease the track speed
+	 * when it gets close to the target. Helps eliminate pointless tremors.
+	 */
+	step >>= dirty_ratelimit / (2 * step + 1);
+	/*
+	 * Limit the tracking speed to avoid overshooting.
+	 */
+	step = (step + 7) / 8;
+
+	if (dirty_ratelimit < balanced_dirty_ratelimit)
+		dirty_ratelimit += step;
+	else
+		dirty_ratelimit -= step;
+
+	bdi->dirty_ratelimit = max(dirty_ratelimit, 1UL);
+	bdi->balanced_dirty_ratelimit = balanced_dirty_ratelimit;
 }
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
--- linux-next.orig/include/linux/backing-dev.h	2011-10-03 11:17:53.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-10-03 11:18:20.000000000 +0800
@@ -83,8 +83,11 @@ struct backing_dev_info {
 	/*
 	 * The base dirty throttle rate, re-calculated on every 200ms.
 	 * All the bdi tasks' dirty rate will be curbed under it.
+	 * @dirty_ratelimit tracks the estimated @balanced_dirty_ratelimit
+	 * in small steps and is much more smooth/stable than the latter.
 	 */
 	unsigned long dirty_ratelimit;
+	unsigned long balanced_dirty_ratelimit;
 
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
--- linux-next.orig/mm/backing-dev.c	2011-10-03 11:18:51.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-10-03 11:20:16.000000000 +0800
@@ -686,6 +686,7 @@ int bdi_init(struct backing_dev_info *bd
 	bdi->bw_time_stamp = jiffies;
 	bdi->written_stamp = 0;
 
+	bdi->balanced_dirty_ratelimit = INIT_BW;
 	bdi->dirty_ratelimit = INIT_BW;
 	bdi->write_bandwidth = INIT_BW;
 	bdi->avg_write_bandwidth = INIT_BW;



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 06/11] writeback: per task dirty rate limit
  2011-10-03 13:42 [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
                   ` (4 preceding siblings ...)
  2011-10-03 13:42 ` [PATCH 05/11] writeback: stabilize bdi->dirty_ratelimit Wu Fengguang
@ 2011-10-03 13:42 ` Wu Fengguang
  2011-10-03 13:42 ` [PATCH 07/11] writeback: IO-less balance_dirty_pages() Wu Fengguang
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-10-03 13:42 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: per-task-ratelimit --]
[-- Type: text/plain, Size: 8284 bytes --]

Add two fields to task_struct.

1) account dirtied pages in the individual tasks, for accuracy
2) per-task balance_dirty_pages() call intervals, for flexibility

The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
scale near-sqrt to the safety gap between dirty pages and threshold.

The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
pages at exactly the same time, each task will be assigned a large
initial nr_dirtied_pause, so that the dirty threshold will be exceeded
long before each task reached its nr_dirtied_pause and hence call
balance_dirty_pages().

The solution is to watch for the number of pages dirtied on each CPU in
between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
(3% dirty threshold), force call balance_dirty_pages() for a chance to
set bdi->dirty_exceeded. In normal situations, this safeguarding
condition is not expected to trigger at all.

On the sqrt in dirty_poll_interval():

It will serve as an initial guess when dirty pages are still in the
freerun area.

When dirty pages are floating inside the dirty control scope [freerun,
limit], a followup patch will use some refined dirty poll interval to
get the desired pause time.

   thresh-dirty (MB)    sqrt
		   1      16
		   2      22
		   4      32
		   8      45
		  16      64
		  32      90
		  64     128
		 128     181
		 256     256
		 512     362
		1024     512

The above table means, given 1MB (or 1GB) gap and the dd tasks polling
balance_dirty_pages() on every 16 (or 512) pages, the dirty limit won't
be exceeded as long as there are less than 16 (or 512) concurrent dd's.

So sqrt naturally leads to less overheads and more safe concurrent tasks
for large memory servers, which have large (thresh-freerun) gaps.

peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    7 +++
 kernel/fork.c         |    3 +
 mm/page-writeback.c   |   89 ++++++++++++++++++++++------------------
 3 files changed, 60 insertions(+), 39 deletions(-)

--- linux-next.orig/include/linux/sched.h	2011-10-03 21:05:31.000000000 +0800
+++ linux-next/include/linux/sched.h	2011-10-03 21:05:40.000000000 +0800
@@ -1525,6 +1525,13 @@ struct task_struct {
 	int make_it_fail;
 #endif
 	struct prop_local_single dirties;
+	/*
+	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
+	 * balance_dirty_pages() for some dirty throttling pause
+	 */
+	int nr_dirtied;
+	int nr_dirtied_pause;
+
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
 	struct latency_record latency_record[LT_SAVECOUNT];
--- linux-next.orig/mm/page-writeback.c	2011-10-03 21:05:39.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-10-03 21:05:40.000000000 +0800
@@ -54,20 +54,6 @@
  */
 static long ratelimit_pages = 32;
 
-/*
- * When balance_dirty_pages decides that the caller needs to perform some
- * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than dirtied pages to ensure that reasonably
- * large amounts of I/O are submitted.
- */
-static inline long sync_writeback_pages(unsigned long dirtied)
-{
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
-	return dirtied + dirtied / 2;
-}
-
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
@@ -169,6 +155,8 @@ static void update_completion_period(voi
 	int shift = calc_period_shift();
 	prop_change_shift(&vm_completions, shift);
 	prop_change_shift(&vm_dirties, shift);
+
+	writeback_set_ratelimit();
 }
 
 int dirty_background_ratio_handler(struct ctl_table *table, int write,
@@ -979,6 +967,23 @@ static void bdi_update_bandwidth(struct 
 }
 
 /*
+ * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
+ * will look to see if it needs to start dirty throttling.
+ *
+ * If dirty_poll_interval is too low, big NUMA machines will call the expensive
+ * global_page_state() too often. So scale it near-sqrt to the safety margin
+ * (the number of pages we may dirty without exceeding the dirty limits).
+ */
+static unsigned long dirty_poll_interval(unsigned long dirty,
+					 unsigned long thresh)
+{
+	if (thresh > dirty)
+		return 1UL << (ilog2(thresh - dirty) >> 1);
+
+	return 1;
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -1112,6 +1117,9 @@ static void balance_dirty_pages(struct a
 	if (clear_dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
+	current->nr_dirtied = 0;
+	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
+
 	if (writeback_in_progress(bdi))
 		return;
 
@@ -1138,7 +1146,7 @@ void set_page_dirty_balance(struct page 
 	}
 }
 
-static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
+static DEFINE_PER_CPU(int, bdp_ratelimits);
 
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
@@ -1158,31 +1166,39 @@ void balance_dirty_pages_ratelimited_nr(
 					unsigned long nr_pages_dirtied)
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
-	unsigned long ratelimit;
-	unsigned long *p;
+	int ratelimit;
+	int *p;
 
 	if (!bdi_cap_account_dirty(bdi))
 		return;
 
-	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
-		ratelimit = 8;
+	ratelimit = current->nr_dirtied_pause;
+	if (bdi->dirty_exceeded)
+		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
+
+	current->nr_dirtied += nr_pages_dirtied;
 
+	preempt_disable();
 	/*
-	 * Check the rate limiting. Also, we do not want to throttle real-time
-	 * tasks in balance_dirty_pages(). Period.
+	 * This prevents one CPU to accumulate too many dirtied pages without
+	 * calling into balance_dirty_pages(), which can happen when there are
+	 * 1000+ tasks, all of them start dirtying pages at exactly the same
+	 * time, hence all honoured too large initial task->nr_dirtied_pause.
 	 */
-	preempt_disable();
 	p =  &__get_cpu_var(bdp_ratelimits);
-	*p += nr_pages_dirtied;
-	if (unlikely(*p >= ratelimit)) {
-		ratelimit = sync_writeback_pages(*p);
+	if (unlikely(current->nr_dirtied >= ratelimit))
 		*p = 0;
-		preempt_enable();
-		balance_dirty_pages(mapping, ratelimit);
-		return;
+	else {
+		*p += nr_pages_dirtied;
+		if (unlikely(*p >= ratelimit_pages)) {
+			*p = 0;
+			ratelimit = 0;
+		}
 	}
 	preempt_enable();
+
+	if (unlikely(current->nr_dirtied >= ratelimit))
+		balance_dirty_pages(mapping, current->nr_dirtied);
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
@@ -1277,22 +1293,17 @@ void laptop_sync_completion(void)
  *
  * Here we set ratelimit_pages to a level which ensures that when all CPUs are
  * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
- * thresholds before writeback cuts in.
- *
- * But the limit should not be set too high.  Because it also controls the
- * amount of memory which the balance_dirty_pages() caller has to write back.
- * If this is too large then the caller will block on the IO queue all the
- * time.  So limit it to four megabytes - the balance_dirty_pages() caller
- * will write six megabyte chunks, max.
+ * thresholds.
  */
 
 void writeback_set_ratelimit(void)
 {
-	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
+	unsigned long background_thresh;
+	unsigned long dirty_thresh;
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+	ratelimit_pages = dirty_thresh / (num_online_cpus() * 32);
 	if (ratelimit_pages < 16)
 		ratelimit_pages = 16;
-	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
-		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
 }
 
 static int __cpuinit
--- linux-next.orig/kernel/fork.c	2011-10-03 21:05:31.000000000 +0800
+++ linux-next/kernel/fork.c	2011-10-03 21:05:40.000000000 +0800
@@ -1302,6 +1302,9 @@ static struct task_struct *copy_process(
 	p->pdeath_signal = 0;
 	p->exit_state = 0;
 
+	p->nr_dirtied = 0;
+	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
+
 	/*
 	 * Ok, make it visible to the rest of the system.
 	 * We dont wake it up yet.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 07/11] writeback: IO-less balance_dirty_pages()
  2011-10-03 13:42 [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
                   ` (5 preceding siblings ...)
  2011-10-03 13:42 ` [PATCH 06/11] writeback: per task dirty rate limit Wu Fengguang
@ 2011-10-03 13:42 ` Wu Fengguang
  2011-10-03 13:42 ` [PATCH 08/11] writeback: limit max dirty pause time Wu Fengguang
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-10-03 13:42 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-ioless-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 15816 bytes --]

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

RATIONALS
=========

- disk seeks on concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)

  With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
  from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".

  * "CPU usage has dropped by ~55%", "it certainly appears that most of
    the CPU time saving comes from the removal of contention on the
    inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
    cacheline bouncing, because the new code is able to call much less
    frequently into balance_dirty_pages() and hence access the global
    page states)

  * the user space "App overhead" is reduced by 20%, by avoiding the
    cacheline pollution by the complex writeback code path

  * "for a ~5% throughput reduction", "the number of write IOs have
    dropped by ~25%", and the elapsed time reduced from 41:42.17 to
    40:53.23.

  * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
    and improves IO throughput from 38MB/s to 42MB/s.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

  Now it's possible to increase writeback chunk size proportional to the
  disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
  the larger writeback size dramatically reduces the seek count to 1/10
  (far beyond my expectation) and improves the write throughput by 24%.

- long block time in balance_dirty_pages() hurts desktop responsiveness

  Many of us may have the experience: it often takes a couple of seconds
  or even long time to stop a heavy writing dd/cp/tar command with
  Ctrl-C or "kill -9".

- IO pipeline broken by bumpy write() progress

  There are a broad class of "loop {read(buf); write(buf);}" applications
  whose read() pipeline will be under-utilized or even come to a stop if
  the write()s have long latencies _or_ don't progress in a constant rate.
  The current threshold based throttling inherently transfers the large
  low level IO completion fluctuations to bumpy application write()s,
  and further deteriorates with increasing number of dirtiers and/or bdi's.

  For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
  the rsync progresses very bumpy in legacy kernel, and throughput is
  improved by 67% by this patchset. (plus the larger write chunk size,
  it will be 93% speedup).

  The new rate based throttling can support 1000+ dd's with excellent
  smoothness, low latency and low overheads.

For the above reasons, it's much better to do IO-less and low latency
pauses in balance_dirty_pages().

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than   4ms, which burns CPU power)
- avoid too large pause time (more than 200ms, which hurts responsiveness)
- avoid big fluctuations of pause times

It can control pause times at will. The default policy (in a followup
patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
in 1000-dd case.

BEHAVIOR CHANGE
===============

(1) dirty threshold

Users will notice that the applications will get throttled once crossing
the global (background + dirty)/2=15% threshold, and then balanced around
17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
memory in 1-dd case.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than 15% dirtyable memory.

(2) smoothness/responsiveness

Users will notice a more responsive system during heavy writeback.
"killall dd" will take effect instantly.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   24 ----
 mm/page-writeback.c              |  161 ++++++++++-------------------
 2 files changed, 56 insertions(+), 129 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-10-03 21:05:40.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-10-03 21:05:43.000000000 +0800
@@ -250,50 +250,6 @@ static void bdi_writeout_fraction(struct
 				numerator, denominator);
 }
 
-static inline void task_dirties_fraction(struct task_struct *tsk,
-		long *numerator, long *denominator)
-{
-	prop_fraction_single(&vm_dirties, &tsk->dirties,
-				numerator, denominator);
-}
-
-/*
- * task_dirty_limit - scale down dirty throttling threshold for one task
- *
- * task specific dirty limit:
- *
- *   dirty -= (dirty/8) * p_{t}
- *
- * To protect light/slow dirtying tasks from heavier/fast ones, we start
- * throttling individual tasks before reaching the bdi dirty limit.
- * Relatively low thresholds will be allocated to heavy dirtiers. So when
- * dirty pages grow large, heavy dirtiers will be throttled first, which will
- * effectively curb the growth of dirty pages. Light dirtiers with high enough
- * dirty threshold may never get throttled.
- */
-#define TASK_LIMIT_FRACTION 8
-static unsigned long task_dirty_limit(struct task_struct *tsk,
-				       unsigned long bdi_dirty)
-{
-	long numerator, denominator;
-	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty / TASK_LIMIT_FRACTION;
-
-	task_dirties_fraction(tsk, &numerator, &denominator);
-	inv *= numerator;
-	do_div(inv, denominator);
-
-	dirty -= inv;
-
-	return max(dirty, bdi_dirty/2);
-}
-
-/* Minimum limit for any task */
-static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
-{
-	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
-}
-
 /*
  *
  */
@@ -986,30 +942,36 @@ static unsigned long dirty_poll_interval
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
- * the caller to perform writeback if the system is over `vm_dirty_ratio'.
+ * the caller to wait once crossing the (background_thresh + dirty_thresh) / 2.
  * If we're over `background_thresh' then the writeback threads are woken to
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
-	unsigned long nr_reclaimable, bdi_nr_reclaimable;
+	unsigned long nr_reclaimable;	/* = file_dirty + unstable_nfs */
+	unsigned long bdi_reclaimable;
 	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	unsigned long bdi_dirty;
 	unsigned long freerun;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long task_bdi_thresh;
-	unsigned long min_task_bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
+	long pause = 0;
 	bool dirty_exceeded = false;
-	bool clear_dirty_exceeded = true;
+	unsigned long task_ratelimit;
+	unsigned long dirty_ratelimit;
+	unsigned long pos_ratio;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
 
 	for (;;) {
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
@@ -1026,9 +988,23 @@ static void balance_dirty_pages(struct a
 		if (nr_dirty <= freerun)
 			break;
 
+		if (unlikely(!writeback_in_progress(bdi)))
+			bdi_start_background_writeback(bdi);
+
+		/*
+		 * bdi_thresh is not treated as some limiting factor as
+		 * dirty_thresh, due to reasons
+		 * - in JBOD setup, bdi_thresh can fluctuate a lot
+		 * - in a system with HDD and USB key, the USB key may somehow
+		 *   go into state (bdi_dirty >> bdi_thresh) either because
+		 *   bdi_dirty starts high, or because bdi_thresh drops low.
+		 *   In this case we don't want to hard throttle the USB key
+		 *   dirtiers for 100 seconds until bdi_dirty drops under
+		 *   bdi_thresh. Instead the auxiliary bdi control line in
+		 *   bdi_position_ratio() will let the dirtier task progress
+		 *   at some rate <= (write_bw / 2) for bringing down bdi_dirty.
+		 */
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
-		min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh);
-		task_bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
@@ -1040,57 +1016,41 @@ static void balance_dirty_pages(struct a
 		 * actually dirty; with m+n sitting in the percpu
 		 * deltas.
 		 */
-		if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+		if (bdi_thresh < 2 * bdi_stat_error(bdi)) {
+			bdi_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
+			bdi_dirty = bdi_reclaimable +
 				    bdi_stat_sum(bdi, BDI_WRITEBACK);
 		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+			bdi_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
+			bdi_dirty = bdi_reclaimable +
 				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
-		/*
-		 * The bdi thresh is somehow "soft" limit derived from the
-		 * global "hard" limit. The former helps to prevent heavy IO
-		 * bdi or process from holding back light ones; The latter is
-		 * the last resort safeguard.
-		 */
-		dirty_exceeded = (bdi_dirty > task_bdi_thresh) ||
+		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
 				  (nr_dirty > dirty_thresh);
-		clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) &&
-					(nr_dirty <= dirty_thresh);
-
-		if (!dirty_exceeded)
-			break;
-
-		if (!bdi->dirty_exceeded)
+		if (dirty_exceeded && !bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
 		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
 				     nr_dirty, bdi_thresh, bdi_dirty,
 				     start_time);
 
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_balance_dirty_start(bdi);
-		if (bdi_nr_reclaimable > task_bdi_thresh) {
-			pages_written += writeback_inodes_wb(&bdi->wb,
-							     write_chunk);
-			trace_balance_dirty_written(bdi, pages_written);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
+		dirty_ratelimit = bdi->dirty_ratelimit;
+		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
+					       background_thresh, nr_dirty,
+					       bdi_thresh, bdi_dirty);
+		if (unlikely(pos_ratio == 0)) {
+			pause = MAX_PAUSE;
+			goto pause;
 		}
+		task_ratelimit = (u64)dirty_ratelimit *
+					pos_ratio >> RATELIMIT_CALC_SHIFT;
+		pause = (HZ * pages_dirtied) / (task_ratelimit | 1);
+		pause = min_t(long, pause, MAX_PAUSE);
+
+pause:
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
-		trace_balance_dirty_wait(bdi);
 
 		dirty_thresh = hard_dirty_limit(dirty_thresh);
 		/*
@@ -1099,22 +1059,11 @@ static void balance_dirty_pages(struct a
 		 * 200ms is typically more than enough to curb heavy dirtiers;
 		 * (b) the pause time limit makes the dirtiers more responsive.
 		 */
-		if (nr_dirty < dirty_thresh &&
-		    bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2 &&
-		    time_after(jiffies, start_time + MAX_PAUSE))
+		if (nr_dirty < dirty_thresh)
 			break;
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
 	}
 
-	/* Clear dirty_exceeded flag only when no task can exceed the limit */
-	if (clear_dirty_exceeded && bdi->dirty_exceeded)
+	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
 	current->nr_dirtied = 0;
@@ -1131,8 +1080,10 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	if (laptop_mode)
+		return;
+
+	if (nr_reclaimable > background_thresh)
 		bdi_start_background_writeback(bdi);
 }
 
--- linux-next.orig/include/trace/events/writeback.h	2011-10-03 21:05:31.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-10-03 21:05:43.000000000 +0800
@@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
 DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
 DEFINE_WRITEBACK_EVENT(writeback_thread_start);
 DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
-DEFINE_WRITEBACK_EVENT(balance_dirty_start);
-DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
-
-TRACE_EVENT(balance_dirty_written,
-
-	TP_PROTO(struct backing_dev_info *bdi, int written),
-
-	TP_ARGS(bdi, written),
-
-	TP_STRUCT__entry(
-		__array(char,	name, 32)
-		__field(int,	written)
-	),
-
-	TP_fast_assign(
-		strncpy(__entry->name, dev_name(bdi->dev), 32);
-		__entry->written = written;
-	),
-
-	TP_printk("bdi %s written %d",
-		  __entry->name,
-		  __entry->written
-	)
-);
 
 DECLARE_EVENT_CLASS(wbc_class,
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 08/11] writeback: limit max dirty pause time
  2011-10-03 13:42 [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
                   ` (6 preceding siblings ...)
  2011-10-03 13:42 ` [PATCH 07/11] writeback: IO-less balance_dirty_pages() Wu Fengguang
@ 2011-10-03 13:42 ` Wu Fengguang
  2011-10-03 13:42 ` [PATCH 09/11] writeback: control " Wu Fengguang
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-10-03 13:42 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: max-pause --]
[-- Type: text/plain, Size: 3090 bytes --]

Apply two policies to scale down the max pause time for

1) small number of concurrent dirtiers
2) small memory system (comparing to storage bandwidth)

MAX_PAUSE=200ms may only be suitable for high end servers with lots of
concurrent dirtiers, where the large pause time can reduce much overheads.

Otherwise, smaller pause time is desirable whenever possible, so as to
get good responsiveness and smooth user experiences. It's actually
required for good disk utilization in the case when all the dirty pages
can be synced to disk within MAX_PAUSE=200ms.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   44 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-10-03 21:05:43.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-10-03 21:05:46.000000000 +0800
@@ -939,6 +939,43 @@ static unsigned long dirty_poll_interval
 	return 1;
 }
 
+static unsigned long bdi_max_pause(struct backing_dev_info *bdi,
+				   unsigned long bdi_dirty)
+{
+	unsigned long bw = bdi->avg_write_bandwidth;
+	unsigned long hi = ilog2(bw);
+	unsigned long lo = ilog2(bdi->dirty_ratelimit);
+	unsigned long t;
+
+	/* target for 20ms max pause on 1-dd case */
+	t = HZ / 50;
+
+	/*
+	 * Scale up pause time for concurrent dirtiers in order to reduce CPU
+	 * overheads.
+	 *
+	 * (N * 20ms) on 2^N concurrent tasks.
+	 */
+	if (hi > lo)
+		t += (hi - lo) * (20 * HZ) / 1024;
+
+	/*
+	 * Limit pause time for small memory systems. If sleeping for too long
+	 * time, a small pool of dirty/writeback pages may go empty and disk go
+	 * idle.
+	 *
+	 * 8 serves as the safety ratio.
+	 */
+	if (bdi_dirty)
+		t = min(t, bdi_dirty * HZ / (8 * bw + 1));
+
+	/*
+	 * The pause time will be settled within range (max_pause/4, max_pause).
+	 * Apply a minimal value of 4 to get a non-zero max_pause/4.
+	 */
+	return clamp_val(t, 4, MAX_PAUSE);
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -958,6 +995,7 @@ static void balance_dirty_pages(struct a
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
 	long pause = 0;
+	long max_pause;
 	bool dirty_exceeded = false;
 	unsigned long task_ratelimit;
 	unsigned long dirty_ratelimit;
@@ -1035,18 +1073,20 @@ static void balance_dirty_pages(struct a
 				     nr_dirty, bdi_thresh, bdi_dirty,
 				     start_time);
 
+		max_pause = bdi_max_pause(bdi, bdi_dirty);
+
 		dirty_ratelimit = bdi->dirty_ratelimit;
 		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
 					       background_thresh, nr_dirty,
 					       bdi_thresh, bdi_dirty);
 		if (unlikely(pos_ratio == 0)) {
-			pause = MAX_PAUSE;
+			pause = max_pause;
 			goto pause;
 		}
 		task_ratelimit = (u64)dirty_ratelimit *
 					pos_ratio >> RATELIMIT_CALC_SHIFT;
 		pause = (HZ * pages_dirtied) / (task_ratelimit | 1);
-		pause = min_t(long, pause, MAX_PAUSE);
+		pause = min(pause, max_pause);
 
 pause:
 		__set_current_state(TASK_UNINTERRUPTIBLE);



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 09/11] writeback: control dirty pause time
  2011-10-03 13:42 [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
                   ` (7 preceding siblings ...)
  2011-10-03 13:42 ` [PATCH 08/11] writeback: limit max dirty pause time Wu Fengguang
@ 2011-10-03 13:42 ` Wu Fengguang
  2011-10-03 13:42 ` [PATCH 10/11] writeback: dirty position control - bdi reserve area Wu Fengguang
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-10-03 13:42 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: max-pause-adaption --]
[-- Type: text/plain, Size: 2263 bytes --]

The dirty pause time shall ultimately be controlled by adjusting
nr_dirtied_pause, since there is relationship

	pause = pages_dirtied / task_ratelimit

Assuming

	pages_dirtied ~= nr_dirtied_pause
	task_ratelimit ~= dirty_ratelimit

We get

	nr_dirtied_pause ~= dirty_ratelimit * desired_pause

Here dirty_ratelimit is preferred over task_ratelimit because it's
more stable.

It's also important to limit possible large transitional errors:

- bw is changing quickly
- pages_dirtied << nr_dirtied_pause on entering dirty exceeded area
- pages_dirtied >> nr_dirtied_pause on btrfs (to be improved by a
  separate fix, but still expect non-trivial errors)

So we end up using the above formula inside clamp_val().

The best test case for this code is to run 100 "dd bs=4M" tasks on
btrfs and check its pause time distribution.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2011-10-03 17:35:57.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-10-03 17:39:27.000000000 +0800
@@ -1086,6 +1086,10 @@ static void balance_dirty_pages(struct a
 		task_ratelimit = (u64)dirty_ratelimit *
 					pos_ratio >> RATELIMIT_CALC_SHIFT;
 		pause = (HZ * pages_dirtied) / (task_ratelimit | 1);
+		if (unlikely(pause <= 0)) {
+			pause = 1; /* avoid resetting nr_dirtied_pause below */
+			break;
+		}
 		pause = min(pause, max_pause);
 
 pause:
@@ -1107,7 +1111,21 @@ pause:
 		bdi->dirty_exceeded = 0;
 
 	current->nr_dirtied = 0;
-	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
+	if (pause == 0) { /* in freerun area */
+		current->nr_dirtied_pause =
+				dirty_poll_interval(nr_dirty, dirty_thresh);
+	} else if (pause <= max_pause / 4 &&
+		   pages_dirtied >= current->nr_dirtied_pause) {
+		current->nr_dirtied_pause = clamp_val(
+					dirty_ratelimit * (max_pause / 2) / HZ,
+					pages_dirtied + pages_dirtied / 8,
+					pages_dirtied * 4);
+	} else if (pause >= max_pause) {
+		current->nr_dirtied_pause = 1 | clamp_val(
+					dirty_ratelimit * (max_pause / 2) / HZ,
+					pages_dirtied / 4,
+					pages_dirtied - pages_dirtied / 8);
+	}
 
 	if (writeback_in_progress(bdi))
 		return;



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 10/11] writeback: dirty position control - bdi reserve area
  2011-10-03 13:42 [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
                   ` (8 preceding siblings ...)
  2011-10-03 13:42 ` [PATCH 09/11] writeback: control " Wu Fengguang
@ 2011-10-03 13:42 ` Wu Fengguang
  2011-10-03 13:42 ` [PATCH 11/11] writeback: per-bdi background threshold Wu Fengguang
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-10-03 13:42 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: bdi-reserve-area --]
[-- Type: text/plain, Size: 1468 bytes --]

Keep a minimal pool of dirty pages for each bdi, so that the disk IO
queues won't underrun. Also gently increase a small bdi_thresh to avoid
it stuck in 0 for some light dirtied bdi.

It's particularly useful for JBOD and small memory system.

It may result in (pos_ratio > 1) at the setpoint and push the dirty
pages high. This is more or less intended because the bdi is in the
danger of IO queue underflow.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-10-03 21:05:48.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-10-03 21:05:51.000000000 +0800
@@ -599,6 +599,7 @@ static unsigned long bdi_position_ratio(
 	 */
 	if (unlikely(bdi_thresh > thresh))
 		bdi_thresh = thresh;
+	bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);
 	/*
 	 * scale global setpoint to bdi's:
 	 *	bdi_setpoint = setpoint * bdi_thresh / thresh
@@ -622,6 +623,20 @@ static unsigned long bdi_position_ratio(
 	} else
 		pos_ratio /= 4;
 
+	/*
+	 * bdi reserve area, safeguard against dirty pool underrun and disk idle
+	 * It may push the desired control point of global dirty pages higher
+	 * than setpoint.
+	 */
+	x_intercept = bdi_thresh / 2;
+	if (bdi_dirty < x_intercept) {
+		if (bdi_dirty > x_intercept / 8) {
+			pos_ratio *= x_intercept;
+			do_div(pos_ratio, bdi_dirty);
+		} else
+			pos_ratio *= 8;
+	}
+
 	return pos_ratio;
 }
 



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 11/11] writeback: per-bdi background threshold
  2011-10-03 13:42 [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
                   ` (9 preceding siblings ...)
  2011-10-03 13:42 ` [PATCH 10/11] writeback: dirty position control - bdi reserve area Wu Fengguang
@ 2011-10-03 13:42 ` Wu Fengguang
  2011-10-03 13:59 ` [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-10-03 13:42 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-bdi-background-thresh.patch --]
[-- Type: text/plain, Size: 5361 bytes --]

One thing puzzled me is that in JBOD case, the per-disk writeout
performance is smaller than the corresponding single-disk case even
when they have comparable bdi_thresh. Tracing shows find that in single
disk case, bdi_writeback is always kept high while in JBOD case, it
could drop low from time to time and correspondingly bdi_reclaimable
could sometimes rush high.

The fix is to watch bdi_reclaimable and kick background writeback as
soon as it goes high. This resembles the global background threshold
but in per-bdi manner. The trick is, as long as bdi_reclaimable does
not go high, bdi_writeback naturally won't go low because
bdi_reclaimable+bdi_writeback ~= bdi_thresh.

With less fluctuated writeback pages, JBOD performance is observed to
increase noticeably in various cases.

vmstat:nr_written values before/after patch:

  3.1.0-rc4-wo-underrun+      3.1.0-rc4-bgthresh3+  
------------------------  ------------------------  
               125596480       +25.9%    158179363  JBOD-10HDD-16G/ext4-100dd-1M-24p-16384M-20:10-X
                61790815      +110.4%    130032231  JBOD-10HDD-16G/ext4-10dd-1M-24p-16384M-20:10-X
                58853546        -0.1%     58823828  JBOD-10HDD-16G/ext4-1dd-1M-24p-16384M-20:10-X
               110159811       +24.7%    137355377  JBOD-10HDD-16G/xfs-100dd-1M-24p-16384M-20:10-X
                69544762       +10.8%     77080047  JBOD-10HDD-16G/xfs-10dd-1M-24p-16384M-20:10-X
                50644862        +0.5%     50890006  JBOD-10HDD-16G/xfs-1dd-1M-24p-16384M-20:10-X
                42677090       +28.0%     54643527  JBOD-10HDD-thresh=100M/ext4-100dd-1M-24p-16384M-100M:10-X
                47491324       +13.3%     53785605  JBOD-10HDD-thresh=100M/ext4-10dd-1M-24p-16384M-100M:10-X
                52548986        +0.9%     53001031  JBOD-10HDD-thresh=100M/ext4-1dd-1M-24p-16384M-100M:10-X
                26783091       +36.8%     36650248  JBOD-10HDD-thresh=100M/xfs-100dd-1M-24p-16384M-100M:10-X
                35526347       +14.0%     40492312  JBOD-10HDD-thresh=100M/xfs-10dd-1M-24p-16384M-100M:10-X
                44670723        -1.1%     44177606  JBOD-10HDD-thresh=100M/xfs-1dd-1M-24p-16384M-100M:10-X
               127996037       +22.4%    156719990  JBOD-10HDD-thresh=2G/ext4-100dd-1M-24p-16384M-2048M:10-X
                57518856        +3.8%     59677625  JBOD-10HDD-thresh=2G/ext4-10dd-1M-24p-16384M-2048M:10-X
                51919909       +12.2%     58269894  JBOD-10HDD-thresh=2G/ext4-1dd-1M-24p-16384M-2048M:10-X
                86410514       +79.0%    154660433  JBOD-10HDD-thresh=2G/xfs-100dd-1M-24p-16384M-2048M:10-X
                40132519       +38.6%     55617893  JBOD-10HDD-thresh=2G/xfs-10dd-1M-24p-16384M-2048M:10-X
                48423248        +7.5%     52042927  JBOD-10HDD-thresh=2G/xfs-1dd-1M-24p-16384M-2048M:10-X
               206041046       +44.1%    296846536  JBOD-10HDD-thresh=4G/xfs-100dd-1M-24p-16384M-4096M:10-X
                72312903       -19.4%     58272885  JBOD-10HDD-thresh=4G/xfs-10dd-1M-24p-16384M-4096M:10-X
                50635672        -0.5%     50384787  JBOD-10HDD-thresh=4G/xfs-1dd-1M-24p-16384M-4096M:10-X
                68308534      +115.7%    147324758  JBOD-10HDD-thresh=800M/ext4-100dd-1M-24p-16384M-800M:10-X
                57882933       +14.5%     66269621  JBOD-10HDD-thresh=800M/ext4-10dd-1M-24p-16384M-800M:10-X
                52183472       +12.8%     58855181  JBOD-10HDD-thresh=800M/ext4-1dd-1M-24p-16384M-800M:10-X
                53788956       +94.2%    104460352  JBOD-10HDD-thresh=800M/xfs-100dd-1M-24p-16384M-800M:10-X
                44493342       +35.5%     60298210  JBOD-10HDD-thresh=800M/xfs-10dd-1M-24p-16384M-800M:10-X
                42641209       +18.9%     50681038  JBOD-10HDD-thresh=800M/xfs-1dd-1M-24p-16384M-800M:10-X

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-10-02 10:28:55.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-10-02 10:42:22.000000000 +0800
@@ -658,14 +658,21 @@ long writeback_inodes_wb(struct bdi_writ
 	return nr_pages - work.nr_pages;
 }
 
-static inline bool over_bground_thresh(void)
+static bool over_bground_thresh(struct backing_dev_info *bdi)
 {
 	unsigned long background_thresh, dirty_thresh;
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 
-	return (global_page_state(NR_FILE_DIRTY) +
-		global_page_state(NR_UNSTABLE_NFS) > background_thresh);
+	if (global_page_state(NR_FILE_DIRTY) +
+	    global_page_state(NR_UNSTABLE_NFS) > background_thresh)
+		return true;
+
+	if (bdi_stat(bdi, BDI_RECLAIMABLE) >
+				bdi_dirty_limit(bdi, background_thresh))
+		return true;
+
+	return false;
 }
 
 /*
@@ -727,7 +734,7 @@ static long wb_writeback(struct bdi_writ
 		 * For background writeout, stop when we are below the
 		 * background dirty threshold
 		 */
-		if (work->for_background && !over_bground_thresh())
+		if (work->for_background && !over_bground_thresh(wb->bdi))
 			break;
 
 		if (work->for_kupdate) {
@@ -811,7 +818,7 @@ static unsigned long get_nr_dirty_pages(
 
 static long wb_check_background_flush(struct bdi_writeback *wb)
 {
-	if (over_bground_thresh()) {
+	if (over_bground_thresh(wb->bdi)) {
 
 		struct wb_writeback_work work = {
 			.nr_pages	= LONG_MAX,



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 00/11] IO-less dirty throttling v12
  2011-10-03 13:42 [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
                   ` (10 preceding siblings ...)
  2011-10-03 13:42 ` [PATCH 11/11] writeback: per-bdi background threshold Wu Fengguang
@ 2011-10-03 13:59 ` Wu Fengguang
  2011-10-05  1:42   ` Wu Fengguang
  2011-10-04 19:52 ` Vivek Goyal
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-10-03 13:59 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Oct 03, 2011 at 09:42:28PM +0800, Wu, Fengguang wrote:
> Hi,
> 
> This is the minimal IO-less balance_dirty_pages() changes that are expected to
> be regression free (well, except for NFS).
> 
>         git://github.com/fengguang/linux.git dirty-throttling-v12
> 
> Tests results will be posted in a separate email.

The complete test matrix for the major filesystems would take some more
days to complete.  As far as I can tell from the current test results,
the writeback performance mostly stays on par with vanilla 3.1 kernel
except for -14% regression on average for NFS, which can be cut down
to -7% by limiting the commit size.

USB stick:

      3.1.0-rc4-vanilla+        3.1.0-rc8-ioless6+  
------------------------  ------------------------  
                   54.39        +0.6%        54.73  3G-UKEY-HDD/xfs-10dd-4k-8p-4096M-20:10-X
                   63.72        -1.8%        62.58  3G-UKEY-HDD/xfs-1dd-4k-8p-4096M-20:10-X
                   58.53        -3.2%        56.65  3G-UKEY-HDD/xfs-2dd-4k-8p-4096M-20:10-X
                    6.31        +1.6%         6.41  UKEY-thresh=50M/xfs-1dd-4k-8p-4096M-50M:10-X
                    4.91        +0.9%         4.95  UKEY-thresh=50M/xfs-2dd-4k-8p-4096M-50M:10-X

single disk:

      3.1.0-rc4-vanilla+        3.1.0-rc8-ioless6+
------------------------  ------------------------
                   47.59        -0.2%        47.50  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
                   56.83        +2.4%        58.18  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
                   54.81        +1.8%        55.79  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
                   45.89        -2.2%        44.89  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
                   56.68        +2.4%        58.06  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
                   53.33        -2.6%        51.94  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
                   89.22        +3.6%        92.40  thresh=1024M-1000M/xfs-10dd-1M-32p-32768M-1024M:1000M-X
                   93.01        -0.4%        92.65  thresh=1024M-1000M/xfs-1dd-1M-32p-32768M-1024M:1000M-X
                   91.19        -0.8%        90.46  thresh=1024M-1000M/xfs-2dd-1M-32p-32768M-1024M:1000M-X
                   58.23        +3.5%        60.29  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
                   57.53        +2.2%        58.80  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
                   57.18        +2.4%        58.53  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
                   35.97       -11.2%        31.96  thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
                   36.55        -1.0%        36.19  thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
                   44.94        +0.2%        45.03  thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
                   53.25        -3.3%        51.47  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
                   56.17        +0.0%        56.19  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
                   58.11        +0.5%        58.41  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
                   41.93        +3.6%        43.44  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
                   46.34        +7.5%        49.83  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
                   52.67        +0.1%        52.70  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
                   25.28       +10.4%        27.91  thresh=1M/xfs-10dd-4k-8p-4096M-1M:10-X
                   31.56       +60.3%        50.61  thresh=1M/xfs-1dd-4k-8p-4096M-1M:10-X
                   43.89        -2.5%        42.81  thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
                   86.10       +25.7%       108.19  thresh=2048M-2000M/xfs-10dd-1M-32p-32768M-2048M:2000M-X
                   93.31        -1.7%        91.69  thresh=2048M-2000M/xfs-1dd-1M-32p-32768M-2048M:2000M-X
                   90.52        +0.2%        90.72  thresh=2048M-2000M/xfs-2dd-1M-32p-32768M-2048M:2000M-X
                   48.57        +4.6%        50.82  thresh=400M-300M/xfs-10dd-4k-8p-4096M-400M:300M-X
                   55.00        +2.4%        56.33  thresh=400M-300M/xfs-1dd-4k-8p-4096M-400M:300M-X
                   52.41        +1.6%        53.27  thresh=400M-300M/xfs-2dd-4k-8p-4096M-400M:300M-X
                   50.78        +1.8%        51.67  thresh=400M/xfs-10dd-4k-8p-4096M-400M:10-X
                   57.48        -0.2%        57.35  thresh=400M/xfs-1dd-4k-8p-4096M-400M:10-X
                   54.14        -1.4%        53.36  thresh=400M/xfs-2dd-4k-8p-4096M-400M:10-X
                   81.43       +11.0%        90.41  thresh=8G/xfs-100dd-1M-32p-32768M-8192M:10-X
                   87.37        +4.9%        91.67  thresh=8G/xfs-10dd-1M-32p-32768M-8192M:10-X
                   92.58        +1.0%        93.51  thresh=8G/xfs-1dd-1M-32p-32768M-8192M:10-X
                   89.78        +3.3%        92.76  thresh=8G/xfs-2dd-1M-32p-32768M-8192M:10-X
                   25.00       +28.7%        32.19  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
                   54.37        +2.7%        55.86  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
                   43.26       +13.2%        48.96  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
                 2350.25        +3.6%      2434.78  TOTAL

single disk, different dd block sizes:

      3.1.0-rc4-vanilla+        3.1.0-rc8-ioless6+  
------------------------  ------------------------  
                   40.88        +6.8%        43.65  3G-bs=1M/xfs-100dd-1M-8p-4096M-20:10-X
                   49.31        +1.3%        49.95  3G-bs=1M/xfs-10dd-1M-8p-4096M-20:10-X
                   54.75        -0.2%        54.66  3G-bs=1M/xfs-1dd-1M-8p-4096M-20:10-X
                   39.65        -1.2%        39.18  3G-bs=1k/xfs-100dd-1k-8p-4096M-20:10-X
                   48.08        +0.3%        48.21  3G-bs=1k/xfs-10dd-1k-8p-4096M-20:10-X
                   52.76        -0.3%        52.60  3G-bs=1k/xfs-1dd-1k-8p-4096M-20:10-X
                   40.60        +7.4%        43.60  3G/xfs-100dd-4k-8p-4096M-20:10-X
                   49.56        +1.9%        50.49  3G/xfs-10dd-4k-8p-4096M-20:10-X
                   53.90        +0.1%        53.95  3G/xfs-1dd-4k-8p-4096M-20:10-X

JBOD:

      3.1.0-rc4-vanilla+        3.1.0-rc8-ioless6+
------------------------  ------------------------
                  653.81        -1.7%       642.93  JBOD-10HDD-16G/xfs-100dd-1M-24p-16384M-20:10-X
                  660.44        -0.5%       657.40  JBOD-10HDD-16G/xfs-10dd-1M-24p-16384M-20:10-X
                  651.53        +3.8%       676.56  JBOD-10HDD-16G/xfs-1dd-1M-24p-16384M-20:10-X
                  330.97        +1.4%       335.62  JBOD-10HDD-6G/ext4-100dd-1M-16p-8192M-20:10-X
                  376.51        +0.3%       377.73  JBOD-10HDD-6G/ext4-10dd-1M-16p-8192M-20:10-X
                  392.36        -1.6%       385.96  JBOD-10HDD-6G/ext4-1dd-1M-16p-8192M-20:10-X
                  390.44        -0.7%       387.56  JBOD-10HDD-6G/ext4-2dd-1M-16p-8192M-20:10-X
                  270.08        +2.1%       275.78  JBOD-10HDD-thresh=100M/ext4-100dd-1M-16p-8192M-100M:10-X
                  325.17       +11.4%       362.32  JBOD-10HDD-thresh=100M/ext4-10dd-1M-16p-8192M-100M:10-X
                  379.30        +3.1%       391.19  JBOD-10HDD-thresh=100M/ext4-1dd-1M-16p-8192M-100M:10-X
                  351.38        +8.6%       381.60  JBOD-10HDD-thresh=100M/ext4-2dd-1M-16p-8192M-100M:10-X
                  351.03       -23.5%       268.55  JBOD-10HDD-thresh=100M/xfs-100dd-1M-24p-16384M-100M:10-X
                  411.98       +19.2%       491.25  JBOD-10HDD-thresh=100M/xfs-10dd-1M-24p-16384M-100M:10-X
                  502.51       +12.9%       567.37  JBOD-10HDD-thresh=100M/xfs-1dd-1M-24p-16384M-100M:10-X
                  345.61        +4.3%       360.62  JBOD-10HDD-thresh=2G/ext4-100dd-1M-16p-8192M-2048M:10-X
                  383.20        +0.2%       383.78  JBOD-10HDD-thresh=2G/ext4-10dd-1M-16p-8192M-2048M:10-X
                  393.46        -0.6%       391.27  JBOD-10HDD-thresh=2G/ext4-1dd-1M-16p-8192M-2048M:10-X
                  393.40        -0.9%       389.85  JBOD-10HDD-thresh=2G/ext4-2dd-1M-16p-8192M-2048M:10-X
                  646.70        -1.2%       638.68  JBOD-10HDD-thresh=2G/xfs-100dd-1M-24p-16384M-2048M:10-X
                  652.26        +2.2%       666.84  JBOD-10HDD-thresh=2G/xfs-10dd-1M-24p-16384M-2048M:10-X
                  642.60        +7.4%       690.19  JBOD-10HDD-thresh=2G/xfs-1dd-1M-24p-16384M-2048M:10-X
                  391.37        -3.7%       376.88  JBOD-10HDD-thresh=4G/ext4-10dd-1M-16p-8192M-4096M:10-X
                  395.83        -1.2%       390.90  JBOD-10HDD-thresh=4G/ext4-1dd-1M-16p-8192M-4096M:10-X
                  398.18        -1.7%       391.44  JBOD-10HDD-thresh=4G/ext4-2dd-1M-16p-8192M-4096M:10-X
                  665.94        -0.9%       659.95  JBOD-10HDD-thresh=4G/xfs-100dd-1M-24p-16384M-4096M:10-X
                  660.60        +0.0%       660.73  JBOD-10HDD-thresh=4G/xfs-10dd-1M-24p-16384M-4096M:10-X
                  655.92        +2.1%       669.58  JBOD-10HDD-thresh=4G/xfs-1dd-1M-24p-16384M-4096M:10-X
                  342.39        +1.4%       347.02  JBOD-10HDD-thresh=800M/ext4-100dd-1M-16p-8192M-800M:10-X
                  367.30        +1.0%       371.03  JBOD-10HDD-thresh=800M/ext4-10dd-1M-16p-8192M-800M:10-X
                  384.76        +0.4%       386.29  JBOD-10HDD-thresh=800M/ext4-1dd-1M-16p-8192M-800M:10-X
                  378.61        +2.4%       387.56  JBOD-10HDD-thresh=800M/ext4-2dd-1M-16p-8192M-800M:10-X
                  556.88        -1.2%       550.21  JBOD-10HDD-thresh=800M/xfs-100dd-1M-24p-16384M-800M:10-X
                  646.96        +2.7%       664.74  JBOD-10HDD-thresh=800M/xfs-10dd-1M-24p-16384M-800M:10-X
                  619.52       +13.2%       701.36  JBOD-10HDD-thresh=800M/xfs-1dd-1M-24p-16384M-800M:10-X
                  209.76        +5.8%       221.88  JBOD-2HDD-6G/xfs-100dd-1M-24p-16384M-20:10-X
                  222.62        +2.3%       227.69  JBOD-2HDD-6G/xfs-10dd-1M-24p-16384M-20:10-X
                  234.09        -1.5%       230.62  JBOD-2HDD-6G/xfs-1dd-1M-24p-16384M-20:10-X
                  146.22       -15.8%       123.06  JBOD-2HDD-thresh=100M/xfs-100dd-1M-24p-16384M-100M:10-X
                  204.93        +0.3%       205.48  JBOD-2HDD-thresh=100M/xfs-10dd-1M-24p-16384M-100M:10-X
                  205.06        +2.7%       210.52  JBOD-2HDD-thresh=100M/xfs-1dd-1M-24p-16384M-100M:10-X
                  120.58       -76.6%        28.19  JBOD-2HDD-thresh=10M/xfs-100dd-1M-24p-16384M-10M:10-X
                   73.11       +53.5%       112.25  JBOD-2HDD-thresh=10M/xfs-10dd-1M-24p-16384M-10M:10-X
                   98.99       +80.2%       178.38  JBOD-2HDD-thresh=10M/xfs-1dd-1M-24p-16384M-10M:10-X
                  340.86        -1.3%       336.28  JBOD-4HDD-6G/xfs-100dd-1M-24p-16384M-20:10-X
                  369.65        +4.2%       385.01  JBOD-4HDD-6G/xfs-10dd-1M-24p-16384M-20:10-X
                  424.24        -3.4%       410.01  JBOD-4HDD-6G/xfs-1dd-1M-24p-16384M-20:10-X
                  279.28       -19.6%       224.53  JBOD-4HDD-thresh=100M/xfs-100dd-1M-24p-16384M-100M:10-X
                  335.48       +11.6%       374.31  JBOD-4HDD-thresh=100M/xfs-10dd-1M-24p-16384M-100M:10-X
                  353.58        +9.0%       385.41  JBOD-4HDD-thresh=100M/xfs-1dd-1M-24p-16384M-100M:10-X
                   34.31        +0.3%        34.42  JBOD-MMAP-RANDWRITE-4K/ext4-fio_mmap_randwrite_4k-4k-16p-8192M-20:10-X
                19621.76        +1.8%     19968.81  TOTAL

software RAID0:

      3.1.0-rc4-vanilla+        3.1.0-rc8-ioless6+  
------------------------  ------------------------  
                  562.90        -2.3%       549.79  RAID0-10HDD-16G/xfs-1000dd-1M-24p-16384M-20:10-X
                  662.22        -0.5%       659.09  RAID0-10HDD-16G/xfs-100dd-1M-24p-16384M-20:10-X
                  645.99        +0.7%       650.40  RAID0-10HDD-16G/xfs-10dd-1M-24p-16384M-20:10-X
                 1871.12        -0.6%      1859.27  TOTAL

NFS:

      3.1.0-rc4-vanilla+        3.1.0-rc8-ioless6+  
------------------------  ------------------------  
                   20.89        +7.4%        22.43  NFS-thresh=100M/nfs-10dd-1M-32p-32768M-100M:10-X
                   39.43       -28.5%        28.21  NFS-thresh=100M/nfs-1dd-1M-32p-32768M-100M:10-X
                   26.60        +9.8%        29.21  NFS-thresh=100M/nfs-2dd-1M-32p-32768M-100M:10-X
                   12.70       +11.2%        14.12  NFS-thresh=10M/nfs-10dd-1M-32p-32768M-10M:10-X
                   27.41        +7.4%        29.44  NFS-thresh=10M/nfs-1dd-1M-32p-32768M-10M:10-X
                   26.52       -65.7%         9.09  NFS-thresh=10M/nfs-2dd-1M-32p-32768M-10M:10-X
                   40.70       -36.9%        25.68  NFS-thresh=1G/nfs-10dd-1M-32p-32768M-1024M:10-X
                   45.28        -9.3%        41.06  NFS-thresh=1G/nfs-1dd-1M-32p-32768M-1024M:10-X
                   35.74        +9.5%        39.13  NFS-thresh=1G/nfs-2dd-1M-32p-32768M-1024M:10-X
                    2.89       +15.8%         3.35  NFS-thresh=1M/nfs-10dd-1M-32p-32768M-1M:10-X
                    6.69       -18.7%         5.44  NFS-thresh=1M/nfs-1dd-1M-32p-32768M-1M:10-X
                    7.16       -57.5%         3.04  NFS-thresh=1M/nfs-2dd-1M-32p-32768M-1M:10-X
                  292.02       -14.3%       250.21  TOTAL

the NFS smooth patch may cut the regressions by half:

      3.1.0-rc8-ioless6+     3.1.0-rc4-nfs-smooth+
------------------------  ------------------------
                   22.43       +39.5%        31.30  NFS-thresh=100M/nfs-10dd-1M-32p-32768M-100M:10-X
                   28.21        -4.7%        26.87  NFS-thresh=100M/nfs-1dd-1M-32p-32768M-100M:10-X
                   29.21       +17.3%        34.28  NFS-thresh=100M/nfs-2dd-1M-32p-32768M-100M:10-X
                   14.12        -6.3%        13.23  NFS-thresh=10M/nfs-10dd-1M-32p-32768M-10M:10-X
                   29.44       -57.6%        12.48  NFS-thresh=10M/nfs-1dd-1M-32p-32768M-10M:10-X
                    9.09       +41.0%        12.81  NFS-thresh=10M/nfs-2dd-1M-32p-32768M-10M:10-X
                   25.68       +63.6%        42.01  NFS-thresh=1G/nfs-10dd-1M-32p-32768M-1024M:10-X
                   41.06        -7.5%        37.97  NFS-thresh=1G/nfs-1dd-1M-32p-32768M-1024M:10-X
                   39.13       +13.0%        44.21  NFS-thresh=1G/nfs-2dd-1M-32p-32768M-1024M:10-X
                    3.35        +2.2%         3.42  NFS-thresh=1M/nfs-10dd-1M-32p-32768M-1M:10-X
                    5.44       +37.0%         7.45  NFS-thresh=1M/nfs-1dd-1M-32p-32768M-1M:10-X
                    3.04       +21.9%         3.70  NFS-thresh=1M/nfs-2dd-1M-32p-32768M-1M:10-X
                  250.21        +7.8%       269.74  TOTAL

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 00/11] IO-less dirty throttling v12
  2011-10-03 13:42 [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
                   ` (11 preceding siblings ...)
  2011-10-03 13:59 ` [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
@ 2011-10-04 19:52 ` Vivek Goyal
  2011-10-05 13:56   ` Wu Fengguang
  2011-10-05 15:16   ` Andi Kleen
  2011-10-10 12:14 ` Peter Zijlstra
  2011-10-20  3:39 ` Wu Fengguang
  14 siblings, 2 replies; 29+ messages in thread
From: Vivek Goyal @ 2011-10-04 19:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Mon, Oct 03, 2011 at 09:42:28PM +0800, Wu Fengguang wrote:
> Hi,
> 
> This is the minimal IO-less balance_dirty_pages() changes that are expected to
> be regression free (well, except for NFS).
> 
>         git://github.com/fengguang/linux.git dirty-throttling-v12
> 
> Tests results will be posted in a separate email.

Looks like we are solving two problems.

- IO less balance_dirty_pages()
- Throttling based on ratelimit instead of based on number of dirty pages.

The second piece is the one which has complicated calculations for
calculating the global/bdi rates and logic for stablizing the rates etc.

IIUC, second piece is primarily needed for better latencies for writers.

Will it make sense to break down this work in two patch series. First
push IO less balance dirty pages and then all the complicated pieces
of ratelimits.

ratelimit allowed you to come up with sleep time for the process. Without
that I think you shall have to fall back to what Jan Kar had done, 
calculation based on number of pages.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 00/11] IO-less dirty throttling v12
  2011-10-03 13:59 ` [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
@ 2011-10-05  1:42   ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-10-05  1:42 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> As far as I can tell from the current test results,
> the writeback performance mostly stays on par with vanilla 3.1 kernel
> except for -14% regression on average for NFS, which can be cut down
> to -7% by limiting the commit size.

I find that the overall NFS throughput can be improved by 42% when
doing the NFS writeback wait queue and limiting the commit size.

      3.1.0-rc8-ioless6+  3.1.0-rc8-nfs-wq-smooth+  
------------------------  ------------------------  
                   22.43       +79.2%        40.20  NFS-thresh=100M/nfs-10dd-1M-32p-32768M-100M:10-X
                   28.21       +11.9%        31.58  NFS-thresh=100M/nfs-1dd-1M-32p-32768M-100M:10-X
                   29.21       +54.0%        44.98  NFS-thresh=100M/nfs-2dd-1M-32p-32768M-100M:10-X
                   14.12       +31.0%        18.50  NFS-thresh=10M/nfs-10dd-1M-32p-32768M-10M:10-X
                   29.44        +2.1%        30.06  NFS-thresh=10M/nfs-1dd-1M-32p-32768M-10M:10-X
                    9.09      +231.0%        30.07  NFS-thresh=10M/nfs-2dd-1M-32p-32768M-10M:10-X
                   25.68       +88.6%        48.43  NFS-thresh=1G/nfs-10dd-1M-32p-32768M-1024M:10-X
                   41.06       +14.9%        47.16  NFS-thresh=1G/nfs-1dd-1M-32p-32768M-1024M:10-X
                   39.13       +26.7%        49.56  NFS-thresh=1G/nfs-2dd-1M-32p-32768M-1024M:10-X
                  238.38       +42.9%       340.54  TOTAL

The theoretic explanation could be, one smooths out the NFS write
requests and the other smooths out the NFS commits, hence yielding
better utilized network/disk pipeline.

As a result, the -14% regression can be turned around into 23% speedup
comparing to vanilla kernel:

      3.1.0-rc4-vanilla+  3.1.0-rc8-nfs-wq-smooth+
------------------------  ------------------------
                   20.89       +92.5%        40.20  NFS-thresh=100M/nfs-10dd-1M-32p-32768M-100M:10-X
                   39.43       -19.9%        31.58  NFS-thresh=100M/nfs-1dd-1M-32p-32768M-100M:10-X
                   26.60       +69.1%        44.98  NFS-thresh=100M/nfs-2dd-1M-32p-32768M-100M:10-X
                   12.70       +45.7%        18.50  NFS-thresh=10M/nfs-10dd-1M-32p-32768M-10M:10-X
                   27.41        +9.7%        30.06  NFS-thresh=10M/nfs-1dd-1M-32p-32768M-10M:10-X
                   26.52       +13.4%        30.07  NFS-thresh=10M/nfs-2dd-1M-32p-32768M-10M:10-X
                   40.70       +19.0%        48.43  NFS-thresh=1G/nfs-10dd-1M-32p-32768M-1024M:10-X
                   45.28        +4.2%        47.16  NFS-thresh=1G/nfs-1dd-1M-32p-32768M-1024M:10-X
                   35.74       +38.7%        49.56  NFS-thresh=1G/nfs-2dd-1M-32p-32768M-1024M:10-X
                  275.28       +23.7%       340.54  TOTAL


The tests don't cover disk arrays on the server side, however it does
test various combinations of memory:bandwidth ratio.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 00/11] IO-less dirty throttling v12
  2011-10-04 19:52 ` Vivek Goyal
@ 2011-10-05 13:56   ` Wu Fengguang
  2011-10-05 15:16   ` Andi Kleen
  1 sibling, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-10-05 13:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, Oct 05, 2011 at 03:52:06AM +0800, Vivek Goyal wrote:
> On Mon, Oct 03, 2011 at 09:42:28PM +0800, Wu Fengguang wrote:
> > Hi,
> > 
> > This is the minimal IO-less balance_dirty_pages() changes that are expected to
> > be regression free (well, except for NFS).
> > 
> >         git://github.com/fengguang/linux.git dirty-throttling-v12
> > 
> > Tests results will be posted in a separate email.
> 
> Looks like we are solving two problems.
> 
> - IO less balance_dirty_pages()
> - Throttling based on ratelimit instead of based on number of dirty pages.
> 
> The second piece is the one which has complicated calculations for
> calculating the global/bdi rates and logic for stablizing the rates etc.
> 
> IIUC, second piece is primarily needed for better latencies for writers.

Well, yes. The bdi->dirty_ratelimit estimation turns out to be the
most confusing part of the patchset... Other than the complexities,
the algorithm does work pretty well in the tests (except for small
memory cases, in which case its estimation accuracy no longer matters).

Note that the bdi->dirty_ratelimit thing, even when goes wrong, is
very unlikely to cause large regressions. The known regressions mostly
originate from the nature of IO-less.

> Will it make sense to break down this work in two patch series. First
> push IO less balance dirty pages and then all the complicated pieces
> of ratelimits.
> 
> ratelimit allowed you to come up with sleep time for the process. Without
> that I think you shall have to fall back to what Jan Kar had done, 
> calculation based on number of pages.

If dropping all the smoothness considerations, the minimal
implementation would be close to this patch:

        [PATCH 05/35] writeback: IO-less balance_dirty_pages()
        http://www.spinics.net/lists/linux-mm/msg12880.html

However the experiences were, it may lead to much worse latencies than
the vanilla one in JBOD cases. This is because vanilla kernel has the
option to break out of the loop when written enough pages, however the
IO-less balance_dirty_pages() will just wait until the dirty pages
drop below the (rushed high) bdi threshold, which could take long time.

Another question is, the IO-less balance_dirty_pages() is basically

        on every N pages dirtied, sleep for M jiffies

In current patchset, we get the desired N with formula

        N = bdi->dirty_ratelimit / desired_M

When dirty_ratelimit is not available, it would be a problem to
estimate the adequate N that works well for various workloads.

And to avoid regressions, patches 8,9,10,11 (maybe updated form) will
still be necessary. And a complete rerun of all the test cases and to
fix up any possible new regressions.

Overall it may cost too much (if possible at all, considering the two
problems listed above) to try out the above steps. The main intention
being "whether we can introduce the dirty_ratelimit complexities later".
Considering that the complexity itself is not likely causing problems
other than lose of smoothness, it looks beneficial to test the ready
made code earlier in production environments, rather than to take lots
of efforts to strip them out and test new code, only to add them back
in some future release.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 00/11] IO-less dirty throttling v12
  2011-10-04 19:52 ` Vivek Goyal
  2011-10-05 13:56   ` Wu Fengguang
@ 2011-10-05 15:16   ` Andi Kleen
  1 sibling, 0 replies; 29+ messages in thread
From: Andi Kleen @ 2011-10-05 15:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Peter Zijlstra, Andrew Morton,
	Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen,
	Minchan Kim, Andrea Righi, linux-mm, LKML

Vivek Goyal <vgoyal@redhat.com> writes:
>
> Will it make sense to break down this work in two patch series. First
> push IO less balance dirty pages and then all the complicated pieces
> of ratelimits.

I would be wary against too much refactoring of well tested patchkits.
I've seen too many cases where this can add nasty and subtle bugs,
given that our unit test coverage is usually relatively poor.

For example the infamous "absolute path names became twice as slow" 
bug was very likely introduced in such a refactoring of a large VFS
patchkit.

While it's generally good to make things easier for reviewers too much
of a good thing can be quite bad.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 00/11] IO-less dirty throttling v12
  2011-10-03 13:42 [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
                   ` (12 preceding siblings ...)
  2011-10-04 19:52 ` Vivek Goyal
@ 2011-10-10 12:14 ` Peter Zijlstra
  2011-10-10 13:07   ` Wu Fengguang
  2011-10-20  3:39 ` Wu Fengguang
  14 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2011-10-10 12:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-10-03 at 21:42 +0800, Wu Fengguang wrote:
> This is the minimal IO-less balance_dirty_pages() changes that are expected to
> be regression free (well, except for NFS).

I can't seem to get around reviewing these patches in detail, but fwiw
I'm fine with pushing fwd with this set (plus a possible NFS fix).

I don't see a reason to strip it down even further.

So I guess that's:

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 00/11] IO-less dirty throttling v12
  2011-10-10 12:14 ` Peter Zijlstra
@ 2011-10-10 13:07   ` Wu Fengguang
  2011-10-10 13:10     ` [RFC][PATCH 1/2] nfs: writeback pages wait queue Wu Fengguang
  2011-10-10 14:28     ` [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
  0 siblings, 2 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-10-10 13:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML, Trond Myklebust, linux-nfs

On Mon, Oct 10, 2011 at 08:14:06PM +0800, Peter Zijlstra wrote:
> On Mon, 2011-10-03 at 21:42 +0800, Wu Fengguang wrote:
> > This is the minimal IO-less balance_dirty_pages() changes that are expected to
> > be regression free (well, except for NFS).
> 
> I can't seem to get around reviewing these patches in detail, but fwiw
> I'm fine with pushing fwd with this set (plus a possible NFS fix).
> 
> I don't see a reason to strip it down even further.
> 
> So I guess that's:
> 
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Thanks :-) In fact you've already reviewed the major parts of the
patchset in great details and helped simplify parts of the algorithm,
which I appreciate a lot.

As for the NFS performance, the dd tests show that adding a writeback
wait queue to limit the number of NFS PG_writeback pages (patches
will follow) is able to gain 48% throughput in itself:

      3.1.0-rc8-ioless6+         3.1.0-rc8-nfs-wq+  
------------------------  ------------------------  
                   22.43       +81.8%        40.77  NFS-thresh=100M/nfs-10dd-1M-32p-32768M-100M:10-X
                   28.21       +52.6%        43.07  NFS-thresh=100M/nfs-1dd-1M-32p-32768M-100M:10-X
                   29.21       +55.4%        45.39  NFS-thresh=100M/nfs-2dd-1M-32p-32768M-100M:10-X
                   14.12       +40.4%        19.83  NFS-thresh=10M/nfs-10dd-1M-32p-32768M-10M:10-X
                   29.44       +11.4%        32.81  NFS-thresh=10M/nfs-1dd-1M-32p-32768M-10M:10-X
                    9.09      +240.9%        30.97  NFS-thresh=10M/nfs-2dd-1M-32p-32768M-10M:10-X
                   25.68       +84.6%        47.42  NFS-thresh=1G/nfs-10dd-1M-32p-32768M-1024M:10-X
                   41.06        +7.6%        44.20  NFS-thresh=1G/nfs-1dd-1M-32p-32768M-1024M:10-X
                   39.13       +25.9%        49.26  NFS-thresh=1G/nfs-2dd-1M-32p-32768M-1024M:10-X
                  238.38       +48.4%       353.72  TOTAL

Which will result in 28% overall improvements over the vanilla kernel:

      3.1.0-rc4-vanilla+         3.1.0-rc8-nfs-wq+  
------------------------  ------------------------  
                   20.89       +95.2%        40.77  NFS-thresh=100M/nfs-10dd-1M-32p-32768M-100M:10-X
                   39.43        +9.2%        43.07  NFS-thresh=100M/nfs-1dd-1M-32p-32768M-100M:10-X
                   26.60       +70.6%        45.39  NFS-thresh=100M/nfs-2dd-1M-32p-32768M-100M:10-X
                   12.70       +56.1%        19.83  NFS-thresh=10M/nfs-10dd-1M-32p-32768M-10M:10-X
                   27.41       +19.7%        32.81  NFS-thresh=10M/nfs-1dd-1M-32p-32768M-10M:10-X
                   26.52       +16.8%        30.97  NFS-thresh=10M/nfs-2dd-1M-32p-32768M-10M:10-X
                   40.70       +16.5%        47.42  NFS-thresh=1G/nfs-10dd-1M-32p-32768M-1024M:10-X
                   45.28        -2.4%        44.20  NFS-thresh=1G/nfs-1dd-1M-32p-32768M-1024M:10-X
                   35.74       +37.8%        49.26  NFS-thresh=1G/nfs-2dd-1M-32p-32768M-1024M:10-X
                  275.28       +28.5%       353.72  TOTAL

As for the most concerned NFS commits, the wait queue patch increases
the (nr_commits / bytes_written) ratio by +74% for the thresh=1G,10dd
case, +55% for the thresh=100M,10dd case, and mostly ignorable in the
other 1dd, 2dd cases, which looks acceptable.

The other noticeable change of the wait queue is, the RTT time per
write is reduced by 1-2 order(s) in many of the below cases (from
dozens of seconds to hundreds of milliseconds).

Thanks,
Fengguang
---

PS. mountstats numbers

thresh=1GB
==========

1dd

vanilla        WRITE: 33108 33108 0 13794766688 4502688 89550800 1826162 91643336
ioless6        WRITE: 104355 104355 0 12824990824 14192280 1677501539 13497260 1691407074
nfs-wq         WRITE: 58632 58632 0 13635750848 7973952 148662395 4735943 153535047

vanilla       COMMIT: 29 29 0 3248 3712 45210 191022 236235
ioless6       COMMIT: 26 26 0 2912 3328 32875 196848 229725
nfs-wq        COMMIT: 35 35 0 3920 4480 1156 223393 224550

2dd

vanilla        WRITE: 28681 28681 0 11507024952 3900616 178242698 5849890 184288501
ioless6        WRITE: 151075 151075 0 12192866408 20546200 3195004617 5748708 3200969292
nfs-wq         WRITE: 89925 89925 0 15450966104 12229800 212096905 3443883 215849660

vanilla       COMMIT: 43 43 0 4816 5504 45252 349816 396792
ioless6       COMMIT: 52 52 0 5824 6656 40798 376099 417068
nfs-wq        COMMIT: 66 66 0 7392 8448 10854 490021 502373

10dd

vanilla        WRITE: 47281 47281 0 14044390136 6430216 1378503679 11994453 1390582846
ioless6        WRITE: 35972 35972 0 7959317984 4892192 1205239506 7412186 1212670083
nfs-wq         WRITE: 49625 49625 0 14819167672 6749000 10704223 4135391 14876589

vanilla       COMMIT: 235 235 0 26320 30080 328532 1097793 1426737
ioless6       COMMIT: 128 128 0 14336 16384 73611 388716 462470
nfs-wq        COMMIT: 431 432 0 48384 55168 217056 1775499 1993006


thresh=100MB
============

1dd

vanilla        WRITE: 28858 28858 0 12427843376 3924688 6384263 2308574 8722669
nfs-wq         WRITE: 206620 206620 0 13104059680 28100320 90597897 10245879 101016004

vanilla       COMMIT: 250 250 0 28000 32000 27030 229750 256786
nfs-wq        COMMIT: 267 267 0 29904 34176 4672 247504 252184

2dd

vanilla        WRITE: 32593 32593 0 8382655992 4432648 193667999 3611697 197302564
nfs-wq         WRITE: 98662 98662 0 14025467856 13418032 183280630 5381343 188715890

vanilla       COMMIT: 272 272 0 30464 34816 24445 295949 320576
nfs-wq        COMMIT: 584 584 0 65408 74752 1318 483049 484442

10dd

vanilla        WRITE: 32294 32294 0 6651515344 4391984 104926130 8666874 113596871
nfs-wq         WRITE: 27571 27571 0 12711521256 3749656 6129491 2248486 8385102

vanilla       COMMIT: 825 825 0 92400 105600 82135 739763 822179
nfs-wq        COMMIT: 2449 2449 0 274288 313472 6091 2057767 2064555

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC][PATCH 1/2] nfs: writeback pages wait queue
  2011-10-10 13:07   ` Wu Fengguang
@ 2011-10-10 13:10     ` Wu Fengguang
  2011-10-10 13:11       ` [RFC][PATCH 2/2] nfs: scale writeback threshold proportional to dirty threshold Wu Fengguang
  2011-10-18  8:51       ` [RFC][PATCH 1/2] nfs: writeback pages wait queue Wu Fengguang
  2011-10-10 14:28     ` [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
  1 sibling, 2 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-10-10 13:10 UTC (permalink / raw)
  To: Trond Myklebust, linux-nfs
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[known bug: this patch will block sync(1) in schecule() if dirty
threshold is set to as low as 1MB.]

The generic writeback routines are departing from congestion_wait()
in preference of get_request_wait(), aka. waiting on the block queues.

Introduce the missing writeback wait queue for NFS, otherwise its
writeback pages will grow out of control, exhausting all PG_dirty pages.

CC: Jens Axboe <axboe@kernel.dk>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/client.c           |    2 
 fs/nfs/write.c            |   89 +++++++++++++++++++++++++++++++-----
 include/linux/nfs_fs_sb.h |    1 
 3 files changed, 81 insertions(+), 11 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2011-09-29 20:23:44.000000000 +0800
+++ linux-next/fs/nfs/write.c	2011-10-05 10:43:28.000000000 +0800
@@ -189,11 +189,64 @@ static int wb_priority(struct writeback_
  * NFS congestion control
  */
 
+#define NFS_WAIT_PAGES	(1024L >> (PAGE_SHIFT - 10))
 int nfs_congestion_kb;
 
-#define NFS_CONGESTION_ON_THRESH 	(nfs_congestion_kb >> (PAGE_SHIFT-10))
-#define NFS_CONGESTION_OFF_THRESH	\
-	(NFS_CONGESTION_ON_THRESH - (NFS_CONGESTION_ON_THRESH >> 2))
+/*
+ * SYNC requests will block on (2*limit) and wakeup on (2*limit-NFS_WAIT_PAGES)
+ * ASYNC requests will block on (limit) and wakeup on (limit - NFS_WAIT_PAGES)
+ * In this way SYNC writes will never be blocked by ASYNC ones.
+ */
+
+static void nfs_set_congested(long nr, struct backing_dev_info *bdi)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+
+	if (nr > limit && !test_bit(BDI_async_congested, &bdi->state))
+		set_bdi_congested(bdi, BLK_RW_ASYNC);
+	else if (nr > 2 * limit && !test_bit(BDI_sync_congested, &bdi->state))
+		set_bdi_congested(bdi, BLK_RW_SYNC);
+}
+
+static void nfs_wait_congested(int is_sync,
+			       struct backing_dev_info *bdi,
+			       wait_queue_head_t *wqh)
+{
+	int waitbit = is_sync ? BDI_sync_congested : BDI_async_congested;
+	DEFINE_WAIT(wait);
+
+	if (!test_bit(waitbit, &bdi->state))
+		return;
+
+	for (;;) {
+		prepare_to_wait(&wqh[is_sync], &wait, TASK_UNINTERRUPTIBLE);
+		if (!test_bit(waitbit, &bdi->state))
+			break;
+
+		io_schedule();
+	}
+	finish_wait(&wqh[is_sync], &wait);
+}
+
+static void nfs_wakeup_congested(long nr,
+				 struct backing_dev_info *bdi,
+				 wait_queue_head_t *wqh)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+
+	if (nr < 2 * limit - min(limit / 8, NFS_WAIT_PAGES)) {
+		if (test_bit(BDI_sync_congested, &bdi->state))
+			clear_bdi_congested(bdi, BLK_RW_SYNC);
+		if (waitqueue_active(&wqh[BLK_RW_SYNC]))
+			wake_up(&wqh[BLK_RW_SYNC]);
+	}
+	if (nr < limit - min(limit / 8, NFS_WAIT_PAGES)) {
+		if (test_bit(BDI_async_congested, &bdi->state))
+			clear_bdi_congested(bdi, BLK_RW_ASYNC);
+		if (waitqueue_active(&wqh[BLK_RW_ASYNC]))
+			wake_up(&wqh[BLK_RW_ASYNC]);
+	}
+}
 
 static int nfs_set_page_writeback(struct page *page)
 {
@@ -204,11 +257,8 @@ static int nfs_set_page_writeback(struct
 		struct nfs_server *nfss = NFS_SERVER(inode);
 
 		page_cache_get(page);
-		if (atomic_long_inc_return(&nfss->writeback) >
-				NFS_CONGESTION_ON_THRESH) {
-			set_bdi_congested(&nfss->backing_dev_info,
-						BLK_RW_ASYNC);
-		}
+		nfs_set_congested(atomic_long_inc_return(&nfss->writeback),
+				  &nfss->backing_dev_info);
 	}
 	return ret;
 }
@@ -220,8 +270,10 @@ static void nfs_end_page_writeback(struc
 
 	end_page_writeback(page);
 	page_cache_release(page);
-	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
-		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+
+	nfs_wakeup_congested(atomic_long_dec_return(&nfss->writeback),
+			     &nfss->backing_dev_info,
+			     nfss->writeback_wait);
 }
 
 static struct nfs_page *nfs_find_and_lock_request(struct page *page, bool nonblock)
@@ -322,19 +374,34 @@ static int nfs_writepage_locked(struct p
 
 int nfs_writepage(struct page *page, struct writeback_control *wbc)
 {
+	struct inode *inode = page->mapping->host;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	int ret;
 
 	ret = nfs_writepage_locked(page, wbc);
 	unlock_page(page);
+
+	nfs_wait_congested(wbc->sync_mode == WB_SYNC_ALL,
+			   &nfss->backing_dev_info,
+			   nfss->writeback_wait);
+
 	return ret;
 }
 
-static int nfs_writepages_callback(struct page *page, struct writeback_control *wbc, void *data)
+static int nfs_writepages_callback(struct page *page,
+				   struct writeback_control *wbc, void *data)
 {
+	struct inode *inode = page->mapping->host;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	int ret;
 
 	ret = nfs_do_writepage(page, wbc, data);
 	unlock_page(page);
+
+	nfs_wait_congested(wbc->sync_mode == WB_SYNC_ALL,
+			   &nfss->backing_dev_info,
+			   nfss->writeback_wait);
+
 	return ret;
 }
 
--- linux-next.orig/include/linux/nfs_fs_sb.h	2011-09-02 09:02:07.000000000 +0800
+++ linux-next/include/linux/nfs_fs_sb.h	2011-10-04 20:26:03.000000000 +0800
@@ -102,6 +102,7 @@ struct nfs_server {
 	struct nfs_iostats __percpu *io_stats;	/* I/O statistics */
 	struct backing_dev_info	backing_dev_info;
 	atomic_long_t		writeback;	/* number of writeback pages */
+	wait_queue_head_t	writeback_wait[2];
 	int			flags;		/* various flags */
 	unsigned int		caps;		/* server capabilities */
 	unsigned int		rsize;		/* read size */
--- linux-next.orig/fs/nfs/client.c	2011-08-22 13:59:52.000000000 +0800
+++ linux-next/fs/nfs/client.c	2011-10-04 20:26:03.000000000 +0800
@@ -1066,6 +1066,8 @@ static struct nfs_server *nfs_alloc_serv
 	INIT_LIST_HEAD(&server->layouts);
 
 	atomic_set(&server->active, 0);
+	init_waitqueue_head(&server->writeback_wait[BLK_RW_SYNC]);
+	init_waitqueue_head(&server->writeback_wait[BLK_RW_ASYNC]);
 
 	server->io_stats = nfs_alloc_iostats();
 	if (!server->io_stats) {

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC][PATCH 2/2] nfs: scale writeback threshold proportional to dirty threshold
  2011-10-10 13:10     ` [RFC][PATCH 1/2] nfs: writeback pages wait queue Wu Fengguang
@ 2011-10-10 13:11       ` Wu Fengguang
  2011-10-18  8:53         ` Wu Fengguang
  2011-10-18  8:51       ` [RFC][PATCH 1/2] nfs: writeback pages wait queue Wu Fengguang
  1 sibling, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-10-10 13:11 UTC (permalink / raw)
  To: Trond Myklebust, linux-nfs
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

nfs_congestion_kb is to control the max allowed writeback and in-commit
pages. It's not reasonable for them to outnumber dirty and to-commit
pages. So each of them should not take more than 1/4 dirty threshold.

Considering that nfs_init_writepagecache() is called on fresh boot,
at the time dirty_thresh is much higher than the real dirty limit after
lots of user space memory consumptions, use 1/8 instead.

We might update nfs_congestion_kb when global dirty limit is changed
at runtime, but whatever, do it simple first.

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c      |   52 ++++++++++++++++++++++++++++--------------
 mm/page-writeback.c |    6 ++++
 2 files changed, 41 insertions(+), 17 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2011-10-09 21:36:22.000000000 +0800
+++ linux-next/fs/nfs/write.c	2011-10-10 21:05:07.000000000 +0800
@@ -1775,61 +1775,79 @@ int nfs_migrate_page(struct address_spac
 	set_page_private(newpage, (unsigned long)req);
 	ClearPagePrivate(page);
 	set_page_private(page, 0);
 	spin_unlock(&mapping->host->i_lock);
 	page_cache_release(page);
 out_unlock:
 	nfs_clear_page_tag_locked(req);
 out:
 	return ret;
 }
 #endif
 
-int __init nfs_init_writepagecache(void)
+void nfs_update_congestion_thresh(void)
 {
-	nfs_wdata_cachep = kmem_cache_create("nfs_write_data",
-					     sizeof(struct nfs_write_data),
-					     0, SLAB_HWCACHE_ALIGN,
-					     NULL);
-	if (nfs_wdata_cachep == NULL)
-		return -ENOMEM;
-
-	nfs_wdata_mempool = mempool_create_slab_pool(MIN_POOL_WRITE,
-						     nfs_wdata_cachep);
-	if (nfs_wdata_mempool == NULL)
-		return -ENOMEM;
-
-	nfs_commit_mempool = mempool_create_slab_pool(MIN_POOL_COMMIT,
-						      nfs_wdata_cachep);
-	if (nfs_commit_mempool == NULL)
-		return -ENOMEM;
+	unsigned long background_thresh;
+	unsigned long dirty_thresh;
 
 	/*
 	 * NFS congestion size, scale with available memory.
 	 *
 	 *  64MB:    8192k
 	 * 128MB:   11585k
 	 * 256MB:   16384k
 	 * 512MB:   23170k
 	 *   1GB:   32768k
 	 *   2GB:   46340k
 	 *   4GB:   65536k
 	 *   8GB:   92681k
 	 *  16GB:  131072k
 	 *
 	 * This allows larger machines to have larger/more transfers.
 	 * Limit the default to 256M
 	 */
 	nfs_congestion_kb = (16*int_sqrt(totalram_pages)) << (PAGE_SHIFT-10);
 	if (nfs_congestion_kb > 256*1024)
 		nfs_congestion_kb = 256*1024;
 
+	/*
+	 * Limit to 1/8 dirty threshold, so that writeback+in_commit pages
+	 * won't overnumber dirty+to_commit pages.
+	 */
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+	dirty_thresh <<= PAGE_SHIFT - 10;
+
+	if (nfs_congestion_kb > dirty_thresh / 8)
+		nfs_congestion_kb = dirty_thresh / 8;
+}
+
+int __init nfs_init_writepagecache(void)
+{
+	nfs_wdata_cachep = kmem_cache_create("nfs_write_data",
+					     sizeof(struct nfs_write_data),
+					     0, SLAB_HWCACHE_ALIGN,
+					     NULL);
+	if (nfs_wdata_cachep == NULL)
+		return -ENOMEM;
+
+	nfs_wdata_mempool = mempool_create_slab_pool(MIN_POOL_WRITE,
+						     nfs_wdata_cachep);
+	if (nfs_wdata_mempool == NULL)
+		return -ENOMEM;
+
+	nfs_commit_mempool = mempool_create_slab_pool(MIN_POOL_COMMIT,
+						      nfs_wdata_cachep);
+	if (nfs_commit_mempool == NULL)
+		return -ENOMEM;
+
+	nfs_update_congestion_thresh();
+
 	return 0;
 }
 
 void nfs_destroy_writepagecache(void)
 {
 	mempool_destroy(nfs_commit_mempool);
 	mempool_destroy(nfs_wdata_mempool);
 	kmem_cache_destroy(nfs_wdata_cachep);
 }
 
--- linux-next.orig/mm/page-writeback.c	2011-10-09 21:36:06.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-10-10 21:05:07.000000000 +0800
@@ -138,34 +138,39 @@ static struct prop_descriptor vm_dirties
 static int calc_period_shift(void)
 {
 	unsigned long dirty_total;
 
 	if (vm_dirty_bytes)
 		dirty_total = vm_dirty_bytes / PAGE_SIZE;
 	else
 		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
 				100;
 	return 2 + ilog2(dirty_total - 1);
 }
 
+void __weak nfs_update_congestion_thresh(void)
+{
+}
+
 /*
  * update the period when the dirty threshold changes.
  */
 static void update_completion_period(void)
 {
 	int shift = calc_period_shift();
 	prop_change_shift(&vm_completions, shift);
 	prop_change_shift(&vm_dirties, shift);
 
 	writeback_set_ratelimit();
+	nfs_update_congestion_thresh();
 }
 
 int dirty_background_ratio_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos)
 {
 	int ret;
 
 	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
 	if (ret == 0 && write)
 		dirty_background_bytes = 0;
 	return ret;
@@ -438,24 +443,25 @@ unsigned long bdi_dirty_limit(struct bac
 	bdi_writeout_fraction(bdi, &numerator, &denominator);
 
 	bdi_dirty = (dirty * (100 - bdi_min_ratio)) / 100;
 	bdi_dirty *= numerator;
 	do_div(bdi_dirty, denominator);
 
 	bdi_dirty += (dirty * bdi->min_ratio) / 100;
 	if (bdi_dirty > (dirty * bdi->max_ratio) / 100)
 		bdi_dirty = dirty * bdi->max_ratio / 100;
 
 	return bdi_dirty;
 }
+EXPORT_SYMBOL_GPL(global_dirty_limits);
 
 /*
  * Dirty position control.
  *
  * (o) global/bdi setpoints
  *
  * We want the dirty pages be balanced around the global/bdi setpoints.
  * When the number of dirty pages is higher/lower than the setpoint, the
  * dirty position control ratio (and hence task dirty ratelimit) will be
  * decreased/increased to bring the dirty pages back to the setpoint.
  *
  *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 00/11] IO-less dirty throttling v12
  2011-10-10 13:07   ` Wu Fengguang
  2011-10-10 13:10     ` [RFC][PATCH 1/2] nfs: writeback pages wait queue Wu Fengguang
@ 2011-10-10 14:28     ` Wu Fengguang
  2011-10-17  3:03       ` Wu Fengguang
  1 sibling, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-10-10 14:28 UTC (permalink / raw)
  To: Trond Myklebust, linux-nfs
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

Hi Trond,

> As for the NFS performance, the dd tests show that adding a writeback
> wait queue to limit the number of NFS PG_writeback pages (patches
> will follow) is able to gain 48% throughput in itself:
> 
>       3.1.0-rc8-ioless6+         3.1.0-rc8-nfs-wq+  
> ------------------------  ------------------------  
>                    22.43       +81.8%        40.77  NFS-thresh=100M/nfs-10dd-1M-32p-32768M-100M:10-X
>                    28.21       +52.6%        43.07  NFS-thresh=100M/nfs-1dd-1M-32p-32768M-100M:10-X
>                    29.21       +55.4%        45.39  NFS-thresh=100M/nfs-2dd-1M-32p-32768M-100M:10-X
>                    14.12       +40.4%        19.83  NFS-thresh=10M/nfs-10dd-1M-32p-32768M-10M:10-X
>                    29.44       +11.4%        32.81  NFS-thresh=10M/nfs-1dd-1M-32p-32768M-10M:10-X
>                     9.09      +240.9%        30.97  NFS-thresh=10M/nfs-2dd-1M-32p-32768M-10M:10-X
>                    25.68       +84.6%        47.42  NFS-thresh=1G/nfs-10dd-1M-32p-32768M-1024M:10-X
>                    41.06        +7.6%        44.20  NFS-thresh=1G/nfs-1dd-1M-32p-32768M-1024M:10-X
>                    39.13       +25.9%        49.26  NFS-thresh=1G/nfs-2dd-1M-32p-32768M-1024M:10-X
>                   238.38       +48.4%       353.72  TOTAL
> 
> Which will result in 28% overall improvements over the vanilla kernel:
> 
>       3.1.0-rc4-vanilla+         3.1.0-rc8-nfs-wq+  
> ------------------------  ------------------------  
>                    20.89       +95.2%        40.77  NFS-thresh=100M/nfs-10dd-1M-32p-32768M-100M:10-X
>                    39.43        +9.2%        43.07  NFS-thresh=100M/nfs-1dd-1M-32p-32768M-100M:10-X
>                    26.60       +70.6%        45.39  NFS-thresh=100M/nfs-2dd-1M-32p-32768M-100M:10-X
>                    12.70       +56.1%        19.83  NFS-thresh=10M/nfs-10dd-1M-32p-32768M-10M:10-X
>                    27.41       +19.7%        32.81  NFS-thresh=10M/nfs-1dd-1M-32p-32768M-10M:10-X
>                    26.52       +16.8%        30.97  NFS-thresh=10M/nfs-2dd-1M-32p-32768M-10M:10-X
>                    40.70       +16.5%        47.42  NFS-thresh=1G/nfs-10dd-1M-32p-32768M-1024M:10-X
>                    45.28        -2.4%        44.20  NFS-thresh=1G/nfs-1dd-1M-32p-32768M-1024M:10-X
>                    35.74       +37.8%        49.26  NFS-thresh=1G/nfs-2dd-1M-32p-32768M-1024M:10-X
>                   275.28       +28.5%       353.72  TOTAL
> 
> As for the most concerned NFS commits, the wait queue patch increases
> the (nr_commits / bytes_written) ratio by +74% for the thresh=1G,10dd
> case, +55% for the thresh=100M,10dd case, and mostly ignorable in the
> other 1dd, 2dd cases, which looks acceptable.
> 
> The other noticeable change of the wait queue is, the RTT time per

Sorry it's not RTT, but mainly the local queue time of the WRITE RPCs.

> write is reduced by 1-2 order(s) in many of the below cases (from
> dozens of seconds to hundreds of milliseconds).

I also measured the stddev of the network bandwidths, and find more
smooth network transfers in general with the wait queue, which is
expected.

thresh=1G
        vanilla       ioless6       nfs-wq
1dd     83088173.728  53468627.578  53627922.011
2dd     52398918.208  43733074.167  53531381.177
10dd    67792638.857  44734947.283  39681731.234

However the major difference should still be that the writeback wait
queue can significantly reduce the local queue time for the WRITE RPCs.

The wait queue patch looks reasonable in that it keeps the pages in
PG_dirty state rather than to prematurely put them to PG_writeback
state only to queue them up for dozens of seconds before xmit.

It should be safe because that's exactly the old proved behavior
before the per-bdi writeback patches introduced in 2.6.32. The 2nd
patch on proportional nfs_congestion_kb is a new change, though.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 00/11] IO-less dirty throttling v12
  2011-10-10 14:28     ` [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
@ 2011-10-17  3:03       ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-10-17  3:03 UTC (permalink / raw)
  To: Trond Myklebust, linux-nfs
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 13557 bytes --]

Hi Trond,

I enhanced the script to compare the write_bw as well as the NFS
write/commit stats.

      3.1.0-rc4-vanilla+         3.1.0-rc8-nfs-wq+
------------------------  ------------------------
(MB/s)            275.28       +28.5%       353.72  TOTAL write_bw

                 5649.00      +192.3%     16510.00  TOTAL nfs_nr_commits (*)
               261987.00      +205.1%    799451.00  TOTAL nfs_nr_writes  (*)

(MB)              866.52       -18.1%       709.85  TOTAL nfs_commit_size (**)
                    2.94       -44.8%         1.62  TOTAL nfs_write_size

(ms)            47814.05       -84.0%      7631.57  TOTAL nfs_write_queue_time
                 1405.05       -53.6%       652.59  TOTAL nfs_write_rtt_time
                49237.94       -83.2%      8292.74  TOTAL nfs_write_execute_time

                 4320.98       -83.2%       726.27  TOTAL nfs_commit_queue_time
                22943.13        -8.6%     20963.46  TOTAL nfs_commit_rtt_time
                27307.42       -20.5%     21714.12  TOTAL nfs_commit_execute_time

(*) The x3 nfs_nr_writes and nfs_nr_commits numbers should be taken
    with a salt because the total written bytes are increased at the
    same time.

(**) The TOTAL nfs_commit_size mainly reflects the thresh=1G cases
     because the numbers in the 10M/100M cases are very small
     comparing to the 1G cases (as shown in the below case by case
     values). Ditto for the *_time values.  However the thresh=1G
     cases should be most close to the typical NFS client setup, so
     the values are still mostly representative numbers.

Below are the detailed case by case views. The script is attached,
which shows how exactly the numbers are calculated from mountstats.

Thanks,
Fengguang
---

      3.1.0-rc4-vanilla+         3.1.0-rc8-nfs-wq+
------------------------  ------------------------
                   20.89       +95.2%        40.77  NFS-thresh=100M/nfs-10dd-1M-32p-32768M-100M:10-X
                   39.43        +9.2%        43.07  NFS-thresh=100M/nfs-1dd-1M-32p-32768M-100M:10-X
                   26.60       +70.6%        45.39  NFS-thresh=100M/nfs-2dd-1M-32p-32768M-100M:10-X
                   12.70       +56.1%        19.83  NFS-thresh=10M/nfs-10dd-1M-32p-32768M-10M:10-X
                   27.41       +19.7%        32.81  NFS-thresh=10M/nfs-1dd-1M-32p-32768M-10M:10-X
                   26.52       +16.8%        30.97  NFS-thresh=10M/nfs-2dd-1M-32p-32768M-10M:10-X
                   40.70       +16.5%        47.42  NFS-thresh=1G/nfs-10dd-1M-32p-32768M-1024M:10-X
                   45.28        -2.4%        44.20  NFS-thresh=1G/nfs-1dd-1M-32p-32768M-1024M:10-X
                   35.74       +37.8%        49.26  NFS-thresh=1G/nfs-2dd-1M-32p-32768M-1024M:10-X
                  275.28       +28.5%       353.72  TOTAL write_bw

      3.1.0-rc4-vanilla+         3.1.0-rc8-nfs-wq+
------------------------  ------------------------
                  825.00      +196.8%      2449.00  NFS-thresh=100M/nfs-10dd-1M-32p-32768M-100M:10-X
                  250.00        +6.8%       267.00  NFS-thresh=100M/nfs-1dd-1M-32p-32768M-100M:10-X
                  272.00      +114.7%       584.00  NFS-thresh=100M/nfs-2dd-1M-32p-32768M-100M:10-X
                 1477.00      +350.8%      6658.00  NFS-thresh=10M/nfs-10dd-1M-32p-32768M-10M:10-X
                  997.00      +115.8%      2152.00  NFS-thresh=10M/nfs-1dd-1M-32p-32768M-10M:10-X
                 1521.00      +154.3%      3868.00  NFS-thresh=10M/nfs-2dd-1M-32p-32768M-10M:10-X
                  235.00       +83.4%       431.00  NFS-thresh=1G/nfs-10dd-1M-32p-32768M-1024M:10-X
                   29.00       +20.7%        35.00  NFS-thresh=1G/nfs-1dd-1M-32p-32768M-1024M:10-X
                   43.00       +53.5%        66.00  NFS-thresh=1G/nfs-2dd-1M-32p-32768M-1024M:10-X
                 5649.00      +192.3%     16510.00  TOTAL nfs_nr_commits

      3.1.0-rc4-vanilla+         3.1.0-rc8-nfs-wq+
------------------------  ------------------------
                32294.00       -14.6%     27571.00  NFS-thresh=100M/nfs-10dd-1M-32p-32768M-100M:10-X
                28858.00      +616.0%    206620.00  NFS-thresh=100M/nfs-1dd-1M-32p-32768M-100M:10-X
                32593.00      +202.7%     98662.00  NFS-thresh=100M/nfs-2dd-1M-32p-32768M-100M:10-X
                18937.00      +111.7%     40085.00  NFS-thresh=10M/nfs-10dd-1M-32p-32768M-10M:10-X
                18762.00      +660.5%    142691.00  NFS-thresh=10M/nfs-1dd-1M-32p-32768M-10M:10-X
                21473.00      +298.8%     85640.00  NFS-thresh=10M/nfs-2dd-1M-32p-32768M-10M:10-X
                47281.00        +5.0%     49625.00  NFS-thresh=1G/nfs-10dd-1M-32p-32768M-1024M:10-X
                33108.00       +77.1%     58632.00  NFS-thresh=1G/nfs-1dd-1M-32p-32768M-1024M:10-X
                28681.00      +213.5%     89925.00  NFS-thresh=1G/nfs-2dd-1M-32p-32768M-1024M:10-X
               261987.00      +205.1%    799451.00  TOTAL nfs_nr_writes

      3.1.0-rc4-vanilla+         3.1.0-rc8-nfs-wq+
------------------------  ------------------------
                    7.69       -35.6%         4.95  NFS-thresh=100M/nfs-10dd-1M-32p-32768M-100M:10-X
                   47.41        -1.3%        46.81  NFS-thresh=100M/nfs-1dd-1M-32p-32768M-100M:10-X
                   29.39       -22.1%        22.90  NFS-thresh=100M/nfs-2dd-1M-32p-32768M-100M:10-X
                    2.73       -68.3%         0.87  NFS-thresh=10M/nfs-10dd-1M-32p-32768M-10M:10-X
                    8.24       -46.7%         4.39  NFS-thresh=10M/nfs-1dd-1M-32p-32768M-10M:10-X
                    5.21       -55.1%         2.34  NFS-thresh=10M/nfs-2dd-1M-32p-32768M-10M:10-X
                   56.99       -42.5%        32.79  NFS-thresh=1G/nfs-10dd-1M-32p-32768M-1024M:10-X
                  453.65       -18.1%       371.54  NFS-thresh=1G/nfs-1dd-1M-32p-32768M-1024M:10-X
                  255.21       -12.5%       223.26  NFS-thresh=1G/nfs-2dd-1M-32p-32768M-1024M:10-X
                  866.52       -18.1%       709.85  TOTAL nfs_commit_size

      3.1.0-rc4-vanilla+         3.1.0-rc8-nfs-wq+
------------------------  ------------------------
                    0.20      +123.8%         0.44  NFS-thresh=100M/nfs-10dd-1M-32p-32768M-100M:10-X
                    0.41       -85.3%         0.06  NFS-thresh=100M/nfs-1dd-1M-32p-32768M-100M:10-X
                    0.25       -44.7%         0.14  NFS-thresh=100M/nfs-2dd-1M-32p-32768M-100M:10-X
                    0.21       -32.4%         0.14  NFS-thresh=10M/nfs-10dd-1M-32p-32768M-10M:10-X
                    0.44       -84.9%         0.07  NFS-thresh=10M/nfs-1dd-1M-32p-32768M-10M:10-X
                    0.37       -71.4%         0.11  NFS-thresh=10M/nfs-2dd-1M-32p-32768M-10M:10-X
                    0.28        +0.5%         0.28  NFS-thresh=1G/nfs-10dd-1M-32p-32768M-1024M:10-X
                    0.40       -44.2%         0.22  NFS-thresh=1G/nfs-1dd-1M-32p-32768M-1024M:10-X
                    0.38       -57.2%         0.16  NFS-thresh=1G/nfs-2dd-1M-32p-32768M-1024M:10-X
                    2.94       -44.8%         1.62  TOTAL nfs_write_size

      3.1.0-rc4-vanilla+         3.1.0-rc8-nfs-wq+
------------------------  ------------------------
                 3249.09       -93.2%       222.32  NFS-thresh=100M/nfs-10dd-1M-32p-32768M-100M:10-X
                  221.23       +98.2%       438.48  NFS-thresh=100M/nfs-1dd-1M-32p-32768M-100M:10-X
                 5942.01       -68.7%      1857.66  NFS-thresh=100M/nfs-2dd-1M-32p-32768M-100M:10-X
                  285.75       -99.9%         0.38  NFS-thresh=10M/nfs-10dd-1M-32p-32768M-10M:10-X
                    6.21       -95.4%         0.28  NFS-thresh=10M/nfs-1dd-1M-32p-32768M-10M:10-X
                   34.73       -92.4%         2.63  NFS-thresh=10M/nfs-2dd-1M-32p-32768M-10M:10-X
                29155.55       -99.3%       215.70  NFS-thresh=1G/nfs-10dd-1M-32p-32768M-1024M:10-X
                 2704.81        -6.3%      2535.52  NFS-thresh=1G/nfs-1dd-1M-32p-32768M-1024M:10-X
                 6214.66       -62.0%      2358.60  NFS-thresh=1G/nfs-2dd-1M-32p-32768M-1024M:10-X
                47814.05       -84.0%      7631.57  TOTAL nfs_write_queue_time

      3.1.0-rc4-vanilla+         3.1.0-rc8-nfs-wq+
------------------------  ------------------------
                  268.37       -69.6%        81.55  NFS-thresh=100M/nfs-10dd-1M-32p-32768M-100M:10-X
                   80.00       -38.0%        49.59  NFS-thresh=100M/nfs-1dd-1M-32p-32768M-100M:10-X
                  110.81       -50.8%        54.54  NFS-thresh=100M/nfs-2dd-1M-32p-32768M-100M:10-X
                  295.72       -41.7%       172.52  NFS-thresh=10M/nfs-10dd-1M-32p-32768M-10M:10-X
                   36.64       -31.6%        25.05  NFS-thresh=10M/nfs-1dd-1M-32p-32768M-10M:10-X
                  100.70       -33.5%        66.93  NFS-thresh=10M/nfs-2dd-1M-32p-32768M-10M:10-X
                  253.68       -67.2%        83.33  NFS-thresh=1G/nfs-10dd-1M-32p-32768M-1024M:10-X
                   55.16       +46.4%        80.77  NFS-thresh=1G/nfs-1dd-1M-32p-32768M-1024M:10-X
                  203.96       -81.2%        38.30  NFS-thresh=1G/nfs-2dd-1M-32p-32768M-1024M:10-X
                 1405.05       -53.6%       652.59  TOTAL nfs_write_rtt_time

      3.1.0-rc4-vanilla+         3.1.0-rc8-nfs-wq+
------------------------  ------------------------
                 3517.58       -91.4%       304.13  NFS-thresh=100M/nfs-10dd-1M-32p-32768M-100M:10-X
                  302.26       +61.7%       488.90  NFS-thresh=100M/nfs-1dd-1M-32p-32768M-100M:10-X
                 6053.53       -68.4%      1912.75  NFS-thresh=100M/nfs-2dd-1M-32p-32768M-100M:10-X
                  581.52       -70.3%       173.00  NFS-thresh=10M/nfs-10dd-1M-32p-32768M-10M:10-X
                   42.99       -40.7%        25.47  NFS-thresh=10M/nfs-1dd-1M-32p-32768M-10M:10-X
                  135.56       -48.5%        69.75  NFS-thresh=10M/nfs-2dd-1M-32p-32768M-10M:10-X
                29411.03       -99.0%       299.78  NFS-thresh=1G/nfs-10dd-1M-32p-32768M-1024M:10-X
                 2768.01        -5.4%      2618.62  NFS-thresh=1G/nfs-1dd-1M-32p-32768M-1024M:10-X
                 6425.46       -62.6%      2400.33  NFS-thresh=1G/nfs-2dd-1M-32p-32768M-1024M:10-X
                49237.94       -83.2%      8292.74  TOTAL nfs_write_execute_time

      3.1.0-rc4-vanilla+         3.1.0-rc8-nfs-wq+
------------------------  ------------------------
                   99.56       -97.5%         2.49  NFS-thresh=100M/nfs-10dd-1M-32p-32768M-100M:10-X
                  108.12       -83.8%        17.50  NFS-thresh=100M/nfs-1dd-1M-32p-32768M-100M:10-X
                   89.87       -97.5%         2.26  NFS-thresh=100M/nfs-2dd-1M-32p-32768M-100M:10-X
                    9.00       -90.5%         0.85  NFS-thresh=10M/nfs-10dd-1M-32p-32768M-10M:10-X
                    2.41       -58.2%         1.01  NFS-thresh=10M/nfs-1dd-1M-32p-32768M-10M:10-X
                    2.68       -59.9%         1.07  NFS-thresh=10M/nfs-2dd-1M-32p-32768M-10M:10-X
                 1398.01       -64.0%       503.61  NFS-thresh=1G/nfs-10dd-1M-32p-32768M-1024M:10-X
                 1558.97       -97.9%        33.03  NFS-thresh=1G/nfs-1dd-1M-32p-32768M-1024M:10-X
                 1052.37       -84.4%       164.45  NFS-thresh=1G/nfs-2dd-1M-32p-32768M-1024M:10-X
                 4320.98       -83.2%       726.27  TOTAL nfs_commit_queue_time

      3.1.0-rc4-vanilla+         3.1.0-rc8-nfs-wq+
------------------------  ------------------------
                  896.68        -6.3%       840.25  NFS-thresh=100M/nfs-10dd-1M-32p-32768M-100M:10-X
                  919.00        +0.9%       926.98  NFS-thresh=100M/nfs-1dd-1M-32p-32768M-100M:10-X
                 1088.05       -24.0%       827.14  NFS-thresh=100M/nfs-2dd-1M-32p-32768M-100M:10-X
                  266.54       -22.7%       206.09  NFS-thresh=10M/nfs-10dd-1M-32p-32768M-10M:10-X
                  191.28       -41.3%       112.32  NFS-thresh=10M/nfs-1dd-1M-32p-32768M-10M:10-X
                  187.90       -34.0%       123.98  NFS-thresh=10M/nfs-2dd-1M-32p-32768M-10M:10-X
                 4671.46       -11.8%      4119.49  NFS-thresh=1G/nfs-10dd-1M-32p-32768M-1024M:10-X
                 6586.97        -3.1%      6382.66  NFS-thresh=1G/nfs-1dd-1M-32p-32768M-1024M:10-X
                 8135.26        -8.7%      7424.56  NFS-thresh=1G/nfs-2dd-1M-32p-32768M-1024M:10-X
                22943.13        -8.6%     20963.46  TOTAL nfs_commit_rtt_time

      3.1.0-rc4-vanilla+         3.1.0-rc8-nfs-wq+
------------------------  ------------------------
                  996.58       -15.4%       843.02  NFS-thresh=100M/nfs-10dd-1M-32p-32768M-100M:10-X
                 1027.14        -8.0%       944.51  NFS-thresh=100M/nfs-1dd-1M-32p-32768M-100M:10-X
                 1178.59       -29.6%       829.52  NFS-thresh=100M/nfs-2dd-1M-32p-32768M-100M:10-X
                  275.75       -24.9%       207.03  NFS-thresh=10M/nfs-10dd-1M-32p-32768M-10M:10-X
                  193.71       -41.5%       113.35  NFS-thresh=10M/nfs-1dd-1M-32p-32768M-10M:10-X
                  190.67       -34.4%       125.12  NFS-thresh=10M/nfs-2dd-1M-32p-32768M-10M:10-X
                 6071.22       -23.8%      4624.14  NFS-thresh=1G/nfs-10dd-1M-32p-32768M-1024M:10-X
                 8146.03       -21.2%      6415.71  NFS-thresh=1G/nfs-1dd-1M-32p-32768M-1024M:10-X
                 9227.72       -17.5%      7611.71  NFS-thresh=1G/nfs-2dd-1M-32p-32768M-1024M:10-X
                27307.42       -20.5%     21714.12  TOTAL nfs_commit_execute_time


[-- Attachment #2: compare.rb --]
[-- Type: application/x-ruby, Size: 6250 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 1/2] nfs: writeback pages wait queue
  2011-10-10 13:10     ` [RFC][PATCH 1/2] nfs: writeback pages wait queue Wu Fengguang
  2011-10-10 13:11       ` [RFC][PATCH 2/2] nfs: scale writeback threshold proportional to dirty threshold Wu Fengguang
@ 2011-10-18  8:51       ` Wu Fengguang
  2011-10-20  3:59         ` Wu Fengguang
  1 sibling, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-10-18  8:51 UTC (permalink / raw)
  To: Trond Myklebust, linux-nfs
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Mon, Oct 10, 2011 at 09:10:51PM +0800, Wu Fengguang wrote:
> [known bug: this patch will block sync(1) in schecule() if dirty
> threshold is set to as low as 1MB.]

The root cause of the deadlock is found to be, the flusher
generated enough PG_writeback pages, and got blocked just before it's
able to assemble one complete NFS WRITE RPC. So the PG_writeback pages
never manage to reach the NFS server!

Feng kindly offers a fix that converts the per-page throttling to the
more coarse grained per-write_pages throttling, which is found to
further increase the performance as well as commit size. Bingo!

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                  354.65       +45.4%       515.48  TOTAL write_bw                                                                                                      
                10498.00       +91.7%     20120.00  TOTAL nfs_nr_commits                                                                                                
               233013.00       +99.9%    465751.00  TOTAL nfs_nr_writes                                                                                                 
                  895.47        +3.1%       923.62  TOTAL nfs_commit_size                                                                                               
                    5.71       -14.5%         4.88  TOTAL nfs_write_size                                                                                                
               108269.33       -84.3%     17003.69  TOTAL nfs_write_queue_time                                                                                          
                 1836.03       -34.4%      1204.27  TOTAL nfs_write_rtt_time                                                                                            
               110144.96       -83.5%     18220.96  TOTAL nfs_write_execute_time                                                                                        
                 2902.62       -88.6%       332.20  TOTAL nfs_commit_queue_time                                                                                         
                16282.75       -23.3%     12490.87  TOTAL nfs_commit_rtt_time                                                                                           
                19234.16       -33.3%     12833.00  TOTAL nfs_commit_execute_time                                                                                       

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                   21.85       +97.9%        43.23  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                   51.38       +42.6%        73.26  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                   28.81      +145.3%        70.68  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                   13.74       +57.1%        21.59  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                   29.11        -0.3%        29.02  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                   16.68       +90.5%        31.78  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                   48.88       +41.2%        69.01  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                   57.85       +32.7%        76.74  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                   47.13       +63.1%        76.87  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                    9.82       -33.0%         6.58  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                   13.72       -18.1%        11.24  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                   15.68       -65.0%         5.48  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                  354.65       +45.4%       515.48  TOTAL write_bw

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                  834.00      +224.2%      2704.00  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                  311.00      +144.1%       759.00  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                  282.00      +253.5%       997.00  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                 1387.00      +334.2%      6023.00  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                 1081.00      +280.3%      4111.00  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                  930.00      +368.0%      4352.00  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                  254.00      +108.7%       530.00  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                   38.00       +55.3%        59.00  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                   54.00       +96.3%       106.00  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                 1321.00       -74.9%       332.00  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                 1932.00       -99.1%        17.00  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                 2074.00       -93.7%       130.00  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                10498.00       +91.7%     20120.00  TOTAL nfs_nr_commits

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                28359.00       -39.2%     17230.00  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                22241.00      +550.6%    144695.00  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                24969.00       +27.8%     31900.00  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                21722.00       +38.2%     30030.00  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                11015.00       +28.2%     14117.00  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                17012.00      +217.7%     54039.00  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                25616.00        +3.1%     26403.00  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                24761.00      +177.5%     68702.00  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                29235.00       +37.1%     40089.00  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                12929.00       +21.6%     15720.00  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                 7683.00       +24.2%      9542.00  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                 7471.00       +77.8%     13284.00  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
               233013.00       +99.9%    465751.00  TOTAL nfs_nr_writes

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                    7.84       -38.6%         4.81  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                   49.58       -41.6%        28.94  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                   30.58       -30.5%        21.27  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                    2.99       -63.9%         1.08  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                    8.06       -73.8%         2.12  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                    5.33       -58.9%         2.19  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                   57.68       -32.1%        39.15  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                  465.15       -16.5%       388.43  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                  261.60       -16.4%       218.80  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                    2.25      +163.5%         5.93  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                    2.13     +9221.2%       198.29  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                    2.27      +455.3%        12.61  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                  895.47        +3.1%       923.62  TOTAL nfs_commit_size

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                    0.23      +227.7%         0.76  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                    0.69       -78.1%         0.15  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                    0.35       +92.4%         0.66  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                    0.19       +13.4%         0.22  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                    0.79       -22.2%         0.62  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                    0.29       -39.5%         0.18  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                    0.57       +37.4%         0.79  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                    0.71       -53.3%         0.33  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                    0.48       +19.7%         0.58  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                    0.23       -45.5%         0.13  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                    0.53       -34.0%         0.35  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                    0.63       -80.4%         0.12  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                    5.71       -14.5%         4.88  TOTAL nfs_write_size

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                 6544.25       -95.1%       321.04  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                 1064.82       +11.2%      1184.16  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                22801.48       -86.3%      3113.39  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                 1083.47       -99.8%         2.56  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                    3.82       -55.8%         1.69  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                 2840.08       -99.3%        20.09  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                20227.73       -96.6%       683.65  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                 2346.04      +274.0%      8774.87  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                50812.68       -94.3%      2901.88  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                  417.03       -99.9%         0.25  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                    1.70       -97.9%         0.04  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                  126.23       -99.9%         0.08  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
               108269.33       -84.3%     17003.69  TOTAL nfs_write_queue_time

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                  276.99       -41.1%       163.20  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                  106.71       -67.0%        35.21  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                   76.32       +13.4%        86.53  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                  335.96       -41.8%       195.49  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                   33.67       +70.7%        57.48  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                  159.03       -46.0%        85.80  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                  340.25       -67.6%       110.23  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                   47.88       +12.7%        53.96  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                  118.13       -53.8%        54.62  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                  223.24       -53.0%       104.83  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                   58.89       +12.8%        66.43  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                   58.97      +223.0%       190.49  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                 1836.03       -34.4%      1204.27  TOTAL nfs_write_rtt_time

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                 6821.43       -92.9%       484.70  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                 1173.98        +4.0%      1220.80  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                22878.44       -86.0%      3201.00  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                 1419.50       -86.0%       198.20  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                   37.72       +57.1%        59.27  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                 2999.22       -96.5%       106.41  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                20570.86       -96.1%       795.46  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                 2416.09      +265.6%      8832.81  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                50941.27       -94.2%      2960.10  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                  640.32       -83.6%       105.13  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                   60.78        +9.4%        66.49  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                  185.35        +2.8%       190.59  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
               110144.96       -83.5%     18220.96  TOTAL nfs_write_execute_time

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                   54.75       -89.4%         5.82  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                   88.26       -98.7%         1.12  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                   38.41       -92.1%         3.05  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                    7.59       -91.0%         0.68  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                    0.42       -93.3%         0.03  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                    2.57       -75.1%         0.64  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                  784.08       -93.8%        48.69  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                 1338.39       -81.4%       248.51  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                  586.69       -96.0%        23.32  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                    1.27       -84.3%         0.20  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                    0.02      +147.1%         0.06  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                    0.16       -41.3%         0.09  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                 2902.62       -88.6%       332.20  TOTAL nfs_commit_queue_time

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                  702.80        +8.2%       760.66  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                  538.99       -35.8%       346.08  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                  704.42       -37.1%       443.00  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                  228.96       -18.4%       186.78  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                  155.88       -54.6%        70.75  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                  169.51       -28.9%       120.53  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                 3791.44       -11.4%      3361.05  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                 4229.79       -17.8%      3476.80  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                 5534.04       -35.4%      3574.73  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                   96.34       -31.4%        66.11  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                   60.95       -35.5%        39.29  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                   69.64       -35.3%        45.08  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                16282.75       -23.3%     12490.87  TOTAL nfs_commit_rtt_time

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                  757.92        +1.2%       766.73  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                  627.36       -44.6%       347.25  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                  743.59       -40.0%       446.39  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                  236.73       -20.8%       187.57  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                  156.31       -54.7%        70.79  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                  172.16       -29.6%       121.26  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                 4579.56       -25.5%      3411.34  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                 5568.53       -33.1%      3725.49  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                 6163.54       -41.5%      3605.27  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                   97.67       -32.1%        66.35  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                   60.99       -35.5%        39.35  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                   69.82       -35.3%        45.20  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                19234.16       -33.3%     12833.00  TOTAL nfs_commit_execute_time

---
Subject: nfs: fix a deadlock in nfs writeback path
Date: Tue Oct 18 16:49:19 CST 2011

From: "Tang, Feng" <feng.tang@intel.com>

In a corner case where nfs_congestion_kb is set very small, there
will be a deadlock happens in nfs_writepages():

	err = write_cache_pages(mapping, wbc, nfs_writepages_callback, &pgio);
	/* this func may be congested when calling nfs_writepages_callback, before
	 * the real pageio req is really issued, thus get blocked for ever */
	nfs_pageio_complete(&pgio);

So move the nfs_wait_congested() after nfs_pageio_complet(&pgio) will fix
the issue, which also is more efficient in calling nfs_wait_congested()
per inode instead of per dirty page of that inode.

Suggested-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |   11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2011-10-17 16:07:24.000000000 +0800
+++ linux-next/fs/nfs/write.c	2011-10-18 16:49:17.000000000 +0800
@@ -391,23 +391,18 @@ int nfs_writepage(struct page *page, str
 static int nfs_writepages_callback(struct page *page,
 				   struct writeback_control *wbc, void *data)
 {
-	struct inode *inode = page->mapping->host;
-	struct nfs_server *nfss = NFS_SERVER(inode);
 	int ret;
 
 	ret = nfs_do_writepage(page, wbc, data);
 	unlock_page(page);
 
-	nfs_wait_congested(wbc->sync_mode == WB_SYNC_ALL,
-			   &nfss->backing_dev_info,
-			   nfss->writeback_wait);
-
 	return ret;
 }
 
 int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	unsigned long *bitlock = &NFS_I(inode)->flags;
 	struct nfs_pageio_descriptor pgio;
 	int err;
@@ -424,6 +419,10 @@ int nfs_writepages(struct address_space 
 	err = write_cache_pages(mapping, wbc, nfs_writepages_callback, &pgio);
 	nfs_pageio_complete(&pgio);
 
+	nfs_wait_congested(wbc->sync_mode == WB_SYNC_ALL,
+			   &nfss->backing_dev_info,
+			   nfss->writeback_wait);
+
 	clear_bit_unlock(NFS_INO_FLUSHING, bitlock);
 	smp_mb__after_clear_bit();
 	wake_up_bit(bitlock, NFS_INO_FLUSHING);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 2/2] nfs: scale writeback threshold proportional to dirty threshold
  2011-10-10 13:11       ` [RFC][PATCH 2/2] nfs: scale writeback threshold proportional to dirty threshold Wu Fengguang
@ 2011-10-18  8:53         ` Wu Fengguang
  2011-10-18  8:59           ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-10-18  8:53 UTC (permalink / raw)
  To: Trond Myklebust, linux-nfs
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

An update from Feng:

Subject: nfs: fix a bug about adjusting nfs_congestion_kb
Date: Tue Oct 18 12:47:58 CST 2011

From: "Tang, Feng" <feng.tang@intel.com>

The VM dirty_thresh may be set to very small(even 0) by wired user, in
such case, nfs_congestion_kb may be adjusted to 0, will cause the normal
NFS write function get congested and deaklocked. So let's set the bottom
line of nfs_congestion_kb to 128kb.

Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |    1 +
 1 file changed, 1 insertion(+)

--- linux-next.orig/fs/nfs/write.c	2011-10-17 16:07:40.000000000 +0800
+++ linux-next/fs/nfs/write.c	2011-10-18 12:47:46.000000000 +0800
@@ -1814,6 +1814,7 @@ void nfs_update_congestion_thresh(void)
 	 */
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 	dirty_thresh <<= PAGE_SHIFT - 10;
+	dirty_thresh += 1024;
 
 	if (nfs_congestion_kb > dirty_thresh / 8)
 		nfs_congestion_kb = dirty_thresh / 8;

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 2/2] nfs: scale writeback threshold proportional to dirty threshold
  2011-10-18  8:53         ` Wu Fengguang
@ 2011-10-18  8:59           ` Wu Fengguang
  2011-10-20  2:49             ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-10-18  8:59 UTC (permalink / raw)
  To: Trond Myklebust, linux-nfs
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML, Tang Feng

> @@ -1814,6 +1814,7 @@ void nfs_update_congestion_thresh(void)
>  	 */
>  	global_dirty_limits(&background_thresh, &dirty_thresh);
>  	dirty_thresh <<= PAGE_SHIFT - 10;
> +	dirty_thresh += 1024;
>  
>  	if (nfs_congestion_kb > dirty_thresh / 8)
>  		nfs_congestion_kb = dirty_thresh / 8;

Here are the new results with ioless + this fix + fix to patch 1.

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                  354.65       +45.4%       515.48  TOTAL write_bw                                                                                                      
                10498.00       +91.7%     20120.00  TOTAL nfs_nr_commits                                                                                                
               233013.00       +99.9%    465751.00  TOTAL nfs_nr_writes                                                                                                 
                  895.47        +3.1%       923.62  TOTAL nfs_commit_size                                                                                               
                    5.71       -14.5%         4.88  TOTAL nfs_write_size                                                                                                
               108269.33       -84.3%     17003.69  TOTAL nfs_write_queue_time                                                                                          
                 1836.03       -34.4%      1204.27  TOTAL nfs_write_rtt_time                                                                                            
               110144.96       -83.5%     18220.96  TOTAL nfs_write_execute_time                                                                                        
                 2902.62       -88.6%       332.20  TOTAL nfs_commit_queue_time                                                                                         
                16282.75       -23.3%     12490.87  TOTAL nfs_commit_rtt_time                                                                                           
                19234.16       -33.3%     12833.00  TOTAL nfs_commit_execute_time                                                                                       

before/after this single fix:
(performance is mostly unchanged except for the thresh=1M cases)

      3.1.0-rc8-nfs-wq3+        3.1.0-rc8-nfs-wq4+  
------------------------  ------------------------  
                   41.67        +3.8%        43.23  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                   77.49        -5.5%        73.26  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                   69.31        +2.0%        70.68  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                   21.99        -1.8%        21.59  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                   29.03        -0.0%        29.02  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                   32.51        -2.2%        31.78  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                   69.78        -1.1%        69.01  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                   73.39        +4.6%        76.74  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                   80.84        -4.9%        76.87  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                    5.78       +13.7%         6.58  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                   11.10        +1.3%        11.24  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                    7.60       -27.9%         5.48  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                  520.49        -1.0%       515.48  TOTAL write_bw

More detailed comparison to vanilla kernel:

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                   21.85       +97.9%        43.23  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                   51.38       +42.6%        73.26  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                   28.81      +145.3%        70.68  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                   13.74       +57.1%        21.59  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                   29.11        -0.3%        29.02  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                   16.68       +90.5%        31.78  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                   48.88       +41.2%        69.01  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                   57.85       +32.7%        76.74  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                   47.13       +63.1%        76.87  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                    9.82       -33.0%         6.58  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                   13.72       -18.1%        11.24  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                   15.68       -65.0%         5.48  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                  354.65       +45.4%       515.48  TOTAL write_bw

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                  834.00      +224.2%      2704.00  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                  311.00      +144.1%       759.00  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                  282.00      +253.5%       997.00  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                 1387.00      +334.2%      6023.00  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                 1081.00      +280.3%      4111.00  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                  930.00      +368.0%      4352.00  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                  254.00      +108.7%       530.00  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                   38.00       +55.3%        59.00  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                   54.00       +96.3%       106.00  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                 1321.00       -74.9%       332.00  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                 1932.00       -99.1%        17.00  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                 2074.00       -93.7%       130.00  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                10498.00       +91.7%     20120.00  TOTAL nfs_nr_commits

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                28359.00       -39.2%     17230.00  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                22241.00      +550.6%    144695.00  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                24969.00       +27.8%     31900.00  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                21722.00       +38.2%     30030.00  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                11015.00       +28.2%     14117.00  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                17012.00      +217.7%     54039.00  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                25616.00        +3.1%     26403.00  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                24761.00      +177.5%     68702.00  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                29235.00       +37.1%     40089.00  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                12929.00       +21.6%     15720.00  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                 7683.00       +24.2%      9542.00  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                 7471.00       +77.8%     13284.00  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
               233013.00       +99.9%    465751.00  TOTAL nfs_nr_writes

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                    7.84       -38.6%         4.81  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                   49.58       -41.6%        28.94  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                   30.58       -30.5%        21.27  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                    2.99       -63.9%         1.08  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                    8.06       -73.8%         2.12  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                    5.33       -58.9%         2.19  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                   57.68       -32.1%        39.15  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                  465.15       -16.5%       388.43  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                  261.60       -16.4%       218.80  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                    2.25      +163.5%         5.93  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                    2.13     +9221.2%       198.29  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                    2.27      +455.3%        12.61  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                  895.47        +3.1%       923.62  TOTAL nfs_commit_size

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                    0.23      +227.7%         0.76  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                    0.69       -78.1%         0.15  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                    0.35       +92.4%         0.66  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                    0.19       +13.4%         0.22  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                    0.79       -22.2%         0.62  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                    0.29       -39.5%         0.18  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                    0.57       +37.4%         0.79  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                    0.71       -53.3%         0.33  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                    0.48       +19.7%         0.58  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                    0.23       -45.5%         0.13  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                    0.53       -34.0%         0.35  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                    0.63       -80.4%         0.12  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                    5.71       -14.5%         4.88  TOTAL nfs_write_size

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                 6544.25       -95.1%       321.04  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                 1064.82       +11.2%      1184.16  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                22801.48       -86.3%      3113.39  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                 1083.47       -99.8%         2.56  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                    3.82       -55.8%         1.69  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                 2840.08       -99.3%        20.09  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                20227.73       -96.6%       683.65  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                 2346.04      +274.0%      8774.87  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                50812.68       -94.3%      2901.88  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                  417.03       -99.9%         0.25  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                    1.70       -97.9%         0.04  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                  126.23       -99.9%         0.08  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
               108269.33       -84.3%     17003.69  TOTAL nfs_write_queue_time

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                  276.99       -41.1%       163.20  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                  106.71       -67.0%        35.21  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                   76.32       +13.4%        86.53  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                  335.96       -41.8%       195.49  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                   33.67       +70.7%        57.48  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                  159.03       -46.0%        85.80  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                  340.25       -67.6%       110.23  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                   47.88       +12.7%        53.96  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                  118.13       -53.8%        54.62  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                  223.24       -53.0%       104.83  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                   58.89       +12.8%        66.43  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                   58.97      +223.0%       190.49  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                 1836.03       -34.4%      1204.27  TOTAL nfs_write_rtt_time

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                 6821.43       -92.9%       484.70  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                 1173.98        +4.0%      1220.80  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                22878.44       -86.0%      3201.00  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                 1419.50       -86.0%       198.20  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                   37.72       +57.1%        59.27  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                 2999.22       -96.5%       106.41  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                20570.86       -96.1%       795.46  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                 2416.09      +265.6%      8832.81  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                50941.27       -94.2%      2960.10  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                  640.32       -83.6%       105.13  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                   60.78        +9.4%        66.49  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                  185.35        +2.8%       190.59  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
               110144.96       -83.5%     18220.96  TOTAL nfs_write_execute_time

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                   54.75       -89.4%         5.82  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                   88.26       -98.7%         1.12  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                   38.41       -92.1%         3.05  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                    7.59       -91.0%         0.68  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                    0.42       -93.3%         0.03  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                    2.57       -75.1%         0.64  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                  784.08       -93.8%        48.69  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                 1338.39       -81.4%       248.51  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                  586.69       -96.0%        23.32  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                    1.27       -84.3%         0.20  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                    0.02      +147.1%         0.06  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                    0.16       -41.3%         0.09  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                 2902.62       -88.6%       332.20  TOTAL nfs_commit_queue_time

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                  702.80        +8.2%       760.66  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                  538.99       -35.8%       346.08  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                  704.42       -37.1%       443.00  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                  228.96       -18.4%       186.78  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                  155.88       -54.6%        70.75  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                  169.51       -28.9%       120.53  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                 3791.44       -11.4%      3361.05  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                 4229.79       -17.8%      3476.80  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                 5534.04       -35.4%      3574.73  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                   96.34       -31.4%        66.11  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                   60.95       -35.5%        39.29  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                   69.64       -35.3%        45.08  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                16282.75       -23.3%     12490.87  TOTAL nfs_commit_rtt_time

      3.1.0-rc8-vanilla+        3.1.0-rc8-nfs-wq4+
------------------------  ------------------------
                  757.92        +1.2%       766.73  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                  627.36       -44.6%       347.25  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                  743.59       -40.0%       446.39  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                  236.73       -20.8%       187.57  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                  156.31       -54.7%        70.79  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                  172.16       -29.6%       121.26  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                 4579.56       -25.5%      3411.34  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                 5568.53       -33.1%      3725.49  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                 6163.54       -41.5%      3605.27  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                   97.67       -32.1%        66.35  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                   60.99       -35.5%        39.35  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                   69.82       -35.3%        45.20  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                19234.16       -33.3%     12833.00  TOTAL nfs_commit_execute_time


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 2/2] nfs: scale writeback threshold proportional to dirty threshold
  2011-10-18  8:59           ` Wu Fengguang
@ 2011-10-20  2:49             ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-10-20  2:49 UTC (permalink / raw)
  To: Trond Myklebust, linux-nfs
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML, Tang Feng

> >  	if (nfs_congestion_kb > dirty_thresh / 8)
> >  		nfs_congestion_kb = dirty_thresh / 8;

To confirm whether that's a good threshold,  I tried double it by
using "/ 4".  It results in -8.4% overall throughput regression. So
I'll stick with the current "/ 8".

Thanks,
Fengguang
---

3.1.0-rc9-ioless-full-next-20111014+  3.1.0-rc9-ioless-full-nfs-thresh-4-next-20111014+
------------------------  ------------------------
                  472.17        -8.4%       432.32  TOTAL write_bw
                15776.00       -27.3%     11466.00  TOTAL nfs_nr_commits
               401731.00       +75.5%    705206.00  TOTAL nfs_nr_writes
                  918.32       +22.4%      1124.07  TOTAL nfs_commit_size
                    4.97       -25.3%         3.71  TOTAL nfs_write_size
                14098.08       -30.6%      9788.99  TOTAL nfs_write_queue_time
                 1314.33       +55.3%      2041.51  TOTAL nfs_write_rtt_time
                15438.68       -23.2%     11851.79  TOTAL nfs_write_execute_time
                  177.75       +73.1%       307.68  TOTAL nfs_commit_queue_time
                15026.97        -3.6%     14491.26  TOTAL nfs_commit_rtt_time
                15227.06        -2.7%     14809.49  TOTAL nfs_commit_execute_time

3.1.0-rc9-ioless-full-next-20111014+  3.1.0-rc9-ioless-full-nfs-thresh-4-next-20111014+
------------------------  ------------------------
                   44.49        -7.7%        41.08  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                   62.35        -1.7%        61.28  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                   67.51       -10.0%        60.76  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                   22.90       -30.0%        16.03  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                   29.91       -16.9%        24.87  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                   26.51       -20.0%        21.20  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                   68.96        -1.2%        68.13  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                   60.08       -17.9%        49.30  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                   70.55        +4.1%        73.44  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                    5.63       -16.9%         4.68  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                    7.91        -2.4%         7.71  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                    5.38       -28.7%         3.84  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                  472.17        -8.4%       432.32  TOTAL write_bw

3.1.0-rc9-ioless-full-next-20111014+  3.1.0-rc9-ioless-full-nfs-thresh-4-next-20111014+
------------------------  ------------------------
                 2565.00       -11.6%      2267.00  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                  384.00        +1.8%       391.00  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                  866.00        -9.9%       780.00  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                 5801.00       -41.0%      3423.00  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                 2092.00       -27.6%      1515.00  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                 3231.00       -24.2%      2450.00  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                  408.00        -2.9%       396.00  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                   46.00        -4.3%        44.00  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                   86.00        +9.3%        94.00  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                  138.00       -51.4%        67.00  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                  119.00       -91.6%        10.00  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                   40.00       -27.5%        29.00  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                15776.00       -27.3%     11466.00  TOTAL nfs_nr_commits

3.1.0-rc9-ioless-full-next-20111014+  3.1.0-rc9-ioless-full-nfs-thresh-4-next-20111014+
------------------------  ------------------------
                17896.00       +30.5%     23348.00  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                62623.00      +371.7%    295418.00  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                37830.00       +37.4%     51991.00  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                27685.00       +13.1%     31312.00  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                51531.00        +7.4%     55330.00  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                84310.00       +14.8%     96814.00  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                21208.00        +0.8%     21375.00  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                26724.00        +5.1%     28092.00  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                31308.00       +67.6%     52461.00  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                12866.00       +13.9%     14649.00  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                12907.00        +9.2%     14088.00  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                14843.00       +37.0%     20328.00  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
               401731.00       +75.5%    705206.00  TOTAL nfs_nr_writes

3.1.0-rc9-ioless-full-next-20111014+  3.1.0-rc9-ioless-full-nfs-thresh-4-next-20111014+
------------------------  ------------------------
                    5.21        +5.2%         5.48  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                   48.66        -3.4%        47.01  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                   23.39        -0.1%        23.36  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                    1.19       +18.7%         1.41  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                    4.28       +15.1%         4.92  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                    2.46        +5.5%         2.60  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                   50.71        +1.7%        51.57  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                  464.69        -0.5%       462.22  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                  245.31        -4.8%       233.60  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                   12.22       +71.4%        20.94  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                   19.89     +1063.9%       231.46  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                   40.33        -2.0%        39.52  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                  918.32       +22.4%      1124.07  TOTAL nfs_commit_size

3.1.0-rc9-ioless-full-next-20111014+  3.1.0-rc9-ioless-full-nfs-thresh-4-next-20111014+
------------------------  ------------------------
                    0.75       -28.7%         0.53  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                    0.30       -79.1%         0.06  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                    0.54       -34.6%         0.35  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                    0.25       -38.1%         0.15  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                    0.17       -22.4%         0.13  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                    0.09       -30.3%         0.07  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                    0.98        -2.1%         0.96  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                    0.80        -9.5%         0.72  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                    0.67       -37.9%         0.42  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                    0.13       -26.9%         0.10  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                    0.18       -10.4%         0.16  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                    0.11       -48.1%         0.06  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                    4.97       -25.3%         3.71  TOTAL nfs_write_size

3.1.0-rc9-ioless-full-next-20111014+  3.1.0-rc9-ioless-full-nfs-thresh-4-next-20111014+
------------------------  ------------------------
                  383.37       -32.7%       258.17  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                 5483.72       -67.9%      1759.47  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                 3757.03       -36.1%      2399.96  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                    2.61     +9692.1%       255.20  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                  145.09      +277.8%       548.13  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                   57.93      +584.8%       396.71  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                  498.56        -1.1%       492.84  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                 2475.43       +15.0%      2847.51  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                 1293.33       -35.9%       829.56  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                    0.20      +188.5%         0.59  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                    0.56       -11.5%         0.49  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                    0.25       +39.0%         0.35  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                14098.08       -30.6%      9788.99  TOTAL nfs_write_queue_time

3.1.0-rc9-ioless-full-next-20111014+  3.1.0-rc9-ioless-full-nfs-thresh-4-next-20111014+
------------------------  ------------------------
                  156.19      +153.8%       396.37  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                   47.14       -60.0%        18.86  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                   74.76       +59.7%       119.40  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                  176.05       +77.5%       312.45  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                   88.28       +32.3%       116.82  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                  107.48       +12.3%       120.71  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                  129.32       -14.8%       110.23  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                   85.62       +36.9%       117.24  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                   58.52       -20.8%        46.34  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                  118.53      +115.7%       255.64  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                  112.27       +40.8%       158.12  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                  160.16       +68.2%       269.33  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                 1314.33       +55.3%      2041.51  TOTAL nfs_write_rtt_time

3.1.0-rc9-ioless-full-next-20111014+  3.1.0-rc9-ioless-full-nfs-thresh-4-next-20111014+
------------------------  ------------------------
                  540.09       +21.2%       654.86  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                 5533.64       -67.8%      1779.65  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                 3833.18       -34.2%      2521.16  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                  178.79      +217.6%       567.74  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                  233.78      +184.6%       665.23  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                  165.69      +212.4%       517.63  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                  630.13        -4.0%       605.10  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                 2574.27       +15.6%      2976.37  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                 1357.04       -35.2%       879.46  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                  118.76      +115.8%       256.26  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                  112.86       +40.6%       158.64  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                  160.43       +68.1%       269.70  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                15438.68       -23.2%     11851.79  TOTAL nfs_write_execute_time

3.1.0-rc9-ioless-full-next-20111014+  3.1.0-rc9-ioless-full-nfs-thresh-4-next-20111014+
------------------------  ------------------------
                    4.15       -29.5%         2.92  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                    5.79      +550.6%        37.64  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                    5.30      +161.4%        13.86  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                    0.63       +23.4%         0.78  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                    0.77       +52.3%         1.17  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                    0.54      +100.0%         1.08  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                   66.40       +44.8%        96.17  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                   25.80       +41.4%        36.48  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                   67.99       +72.2%       117.05  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                    0.19       +66.4%         0.31  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                    0.10                      0.00  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                    0.10      +106.9%         0.21  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                  177.75       +73.1%       307.68  TOTAL nfs_commit_queue_time

3.1.0-rc9-ioless-full-next-20111014+  3.1.0-rc9-ioless-full-nfs-thresh-4-next-20111014+
------------------------  ------------------------
                  786.25        -7.3%       729.04  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                  523.35        +4.8%       548.62  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                  487.58        +4.9%       511.71  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                  181.60       +22.2%       221.92  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                  110.03       +13.2%       124.60  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                  128.24       +11.0%       142.35  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                 4352.76        -5.2%      4126.10  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                 4022.83        -2.6%      3917.34  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                 4282.52        -6.6%      3999.64  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                   62.09       +56.1%        96.94  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                   39.20       -22.5%        30.40  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                   50.52       -15.7%        42.59  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                15026.97        -3.6%     14491.26  TOTAL nfs_commit_rtt_time

3.1.0-rc9-ioless-full-next-20111014+  3.1.0-rc9-ioless-full-nfs-thresh-4-next-20111014+
------------------------  ------------------------
                  790.68        -7.4%       732.23  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                  529.17       +10.8%       586.28  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                  493.32        +6.6%       525.82  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                  182.35       +22.2%       222.82  NFS-thresh=10M/nfs-10dd-4k-32p-32768M-10M:10-X
                  110.82       +13.5%       125.80  NFS-thresh=10M/nfs-1dd-4k-32p-32768M-10M:10-X
                  128.80       +11.4%       143.46  NFS-thresh=10M/nfs-2dd-4k-32p-32768M-10M:10-X
                 4420.89        -4.4%      4224.24  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                 4048.67        -2.3%      3953.91  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                 4370.05        -5.6%      4124.34  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                   62.33       +56.1%        97.28  NFS-thresh=1M/nfs-10dd-4k-32p-32768M-1M:10-X
                   39.33       -22.4%        30.50  NFS-thresh=1M/nfs-1dd-4k-32p-32768M-1M:10-X
                   50.65       -15.5%        42.79  NFS-thresh=1M/nfs-2dd-4k-32p-32768M-1M:10-X
                15227.06        -2.7%     14809.49  TOTAL nfs_commit_execute_time

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 00/11] IO-less dirty throttling v12
  2011-10-03 13:42 [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
                   ` (13 preceding siblings ...)
  2011-10-10 12:14 ` Peter Zijlstra
@ 2011-10-20  3:39 ` Wu Fengguang
  14 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-10-20  3:39 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML, Chris Mason, Theodore Ts'o

FYI, a simple sequential write comparison between the common filesystems.

For a newly created filesystem, btrfs is super fast!  The interesting thing is,
btrfs performs best in the dirty_thresh=100M cases, rather than the 1G/8G cases.

btrfs also performs equally well in the 1dd, 2dd, 10dd, 100dd cases. However
the tests are blind to the possibility of long term fragmentation.

                   btrfs                      ext3                      ext4                       xfs  
------------------------  ------------------------  ------------------------  ------------------------  
                   56.55       -45.8%        30.66       -10.8%        50.42       -26.1%        41.76  thresh=1G/X-100dd-4k-8p-4096M-1024M:10-3.1.0-rc8-ioless6a+
                   56.11       -37.2%        35.24        +0.2%        56.23       -13.9%        48.34  thresh=1G/X-10dd-4k-8p-4096M-1024M:10-3.1.0-rc8-ioless6a+
                   56.21       -22.5%        43.58        +3.4%        58.12        -6.9%        52.36  thresh=1G/X-1dd-4k-8p-4096M-1024M:10-3.1.0-rc8-ioless6a+

                   58.23       -35.9%        37.34       -20.2%        46.45       -23.5%        44.53  thresh=100M/X-10dd-4k-8p-4096M-100M:10-3.1.0-rc8-ioless6a+
                   58.43       -23.9%        44.44        -3.1%        56.60        -4.4%        55.89  thresh=100M/X-1dd-4k-8p-4096M-100M:10-3.1.0-rc8-ioless6a+
                   58.53       -28.7%        41.70        -7.5%        54.14       -12.7%        51.11  thresh=100M/X-2dd-4k-8p-4096M-100M:10-3.1.0-rc8-ioless6a+

                   54.37       -40.8%        32.21       -34.6%        35.58       -42.9%        31.07  thresh=8M/X-10dd-4k-8p-4096M-8M:10-3.1.0-rc8-ioless6a+
                   56.12       -19.1%        45.37        +0.5%        56.39        -1.2%        55.44  thresh=8M/X-1dd-4k-8p-4096M-8M:10-3.1.0-rc8-ioless6a+
                   56.22       -22.3%        43.71        -8.8%        51.26       -15.4%        47.59  thresh=8M/X-2dd-4k-8p-4096M-8M:10-3.1.0-rc8-ioless6a+

                  510.77       -30.6%       354.27        -8.9%       465.19       -16.2%       428.07  TOTAL write_bw

Below is a more extensive run on virtually the same kernel.

In the thresh=8G case, ext4 performs noticeably better than others,
and the number of dd tasks is no longer relevant with big enough memory.

                   btrfs                      ext3                      ext4                       xfs  
------------------------  ------------------------  ------------------------  ------------------------  
                   92.89       -28.6%        66.36        +7.8%       100.10        -2.7%        90.41  thresh=8G/X-100dd-1M-32p-32768M-8192M:10-3.1.0-rc8-ioless6+
                   89.69       -19.7%        72.00       +18.7%       106.42        +2.2%        91.67  thresh=8G/X-10dd-1M-32p-32768M-8192M:10-3.1.0-rc8-ioless6+
                   92.26       -18.7%        75.01       +16.3%       107.26        +1.4%        93.51  thresh=8G/X-1dd-1M-32p-32768M-8192M:10-3.1.0-rc8-ioless6+
                   89.96       -16.8%        74.87       +20.7%       108.62        +3.1%        92.76  thresh=8G/X-2dd-1M-32p-32768M-8192M:10-3.1.0-rc8-ioless6+
note: the above 8G cases run on another test box!

                   60.29       -47.0%        31.96       -14.6%        51.47       -27.9%        43.44  thresh=1G/X-100dd-4k-8p-4096M-1024M:10-3.1.0-rc8-ioless6+
                   58.80       -38.5%        36.19        -4.4%        56.19       -15.3%        49.83  thresh=1G/X-10dd-4k-8p-4096M-1024M:10-3.1.0-rc8-ioless6+
                   58.53       -23.1%        45.03        -0.2%        58.41       -10.0%        52.70  thresh=1G/X-1dd-4k-8p-4096M-1024M:10-3.1.0-rc8-ioless6+

                   58.01       -35.0%        37.69        -4.1%        55.62       -12.4%        50.82  thresh=400M-300M/X-10dd-4k-8p-4096M-400M:300M-3.1.0-rc8-ioless6+
                   57.69       -26.0%        42.69        +1.8%        58.71        -2.4%        56.33  thresh=400M-300M/X-1dd-4k-8p-4096M-400M:300M-3.1.0-rc8-ioless6+
                   57.13       -32.4%        38.63        +2.5%        58.58        -6.8%        53.27  thresh=400M-300M/X-2dd-4k-8p-4096M-400M:300M-3.1.0-rc8-ioless6+

                   56.97       -33.3%        38.01        -3.2%        55.14        -9.3%        51.67  thresh=400M/X-10dd-4k-8p-4096M-400M:10-3.1.0-rc8-ioless6+
                   57.78       -22.3%        44.90        +0.6%        58.14        -0.7%        57.35  thresh=400M/X-1dd-4k-8p-4096M-400M:10-3.1.0-rc8-ioless6+
                   56.12       -27.3%        40.81        +2.4%        57.49        -4.9%        53.36  thresh=400M/X-2dd-4k-8p-4096M-400M:10-3.1.0-rc8-ioless6+

                   59.39       -36.0%        38.02       -20.0%        47.50       -24.4%        44.89  thresh=100M/X-10dd-4k-8p-4096M-100M:10-3.1.0-rc8-ioless6+
                   58.68       -23.0%        45.20        -0.9%        58.18        -1.1%        58.06  thresh=100M/X-1dd-4k-8p-4096M-100M:10-3.1.0-rc8-ioless6+
                   58.92       -27.9%        42.50        -5.3%        55.79       -11.8%        51.94  thresh=100M/X-2dd-4k-8p-4096M-100M:10-3.1.0-rc8-ioless6+

                   57.12       -41.1%        33.63       -36.0%        36.58       -43.6%        32.19  thresh=8M/X-10dd-4k-8p-4096M-8M:10-3.1.0-rc8-ioless6+
                   59.29       -18.5%        48.30        -3.3%        57.35        -5.8%        55.86  thresh=8M/X-1dd-4k-8p-4096M-8M:10-3.1.0-rc8-ioless6+
                   59.23       -21.0%        46.77       -10.8%        52.82       -17.3%        48.96  thresh=8M/X-2dd-4k-8p-4096M-8M:10-3.1.0-rc8-ioless6+

                 1238.75       -27.5%       898.56        +0.1%      1240.38        -8.9%      1129.01  TOTAL write_bw

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 1/2] nfs: writeback pages wait queue
  2011-10-18  8:51       ` [RFC][PATCH 1/2] nfs: writeback pages wait queue Wu Fengguang
@ 2011-10-20  3:59         ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-10-20  3:59 UTC (permalink / raw)
  To: Trond Myklebust, linux-nfs
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

> @@ -424,6 +419,10 @@ int nfs_writepages(struct address_space 
>  	err = write_cache_pages(mapping, wbc, nfs_writepages_callback, &pgio);
>  	nfs_pageio_complete(&pgio);
>  
> +	nfs_wait_congested(wbc->sync_mode == WB_SYNC_ALL,
> +			   &nfss->backing_dev_info,
> +			   nfss->writeback_wait);
> +
>  	clear_bit_unlock(NFS_INO_FLUSHING, bitlock);
>  	smp_mb__after_clear_bit();
>  	wake_up_bit(bitlock, NFS_INO_FLUSHING);

The "wakeup NFS_INO_FLUSHING after congestion wait" logic looks
strange, so I tried moving the nfs_wait_congested() _after_
wake_up_bit()...and got write_bw regressions.

OK, not knowing what's going on underneath, I'll just stick to the current form.

3.1.0-rc9-ioless-full-nfs-ino-flushing-next-20111014+  3.1.0-rc9-ioless-full-nfs-wakeup-wait-next-20111014+  
------------------------  ------------------------  
                  417.26        -5.5%       394.24  TOTAL write_bw
                 5179.00       -12.6%      4529.00  TOTAL nfs_nr_commits
               340466.00       -37.2%    213939.00  TOTAL nfs_nr_writes
                  722.54       +17.6%       849.42  TOTAL nfs_commit_size
                    3.75        +6.3%         3.99  TOTAL nfs_write_size
                15477.38       -14.5%     13235.34  TOTAL nfs_write_queue_time
                  517.54       +13.0%       585.00  TOTAL nfs_write_rtt_time
                16011.09       -13.5%     13848.09  TOTAL nfs_write_execute_time
                  714.65       -43.4%       404.65  TOTAL nfs_commit_queue_time
                12787.93        +9.2%     13960.35  TOTAL nfs_commit_rtt_time
                13519.94        +6.4%     14387.44  TOTAL nfs_commit_execute_time

3.1.0-rc9-ioless-full-nfs-ino-flushing-next-20111014+  3.1.0-rc9-ioless-full-nfs-wakeup-wait-next-20111014+
------------------------  ------------------------
                   44.42        -0.8%        44.05  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                   78.49        -8.0%        72.22  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                   69.96        -2.4%        68.30  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                   70.59        -3.8%        67.88  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                   76.76        -8.7%        70.09  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                   77.04        -6.9%        71.70  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                  417.26        -5.5%       394.24  TOTAL write_bw

3.1.0-rc9-ioless-full-nfs-ino-flushing-next-20111014+  3.1.0-rc9-ioless-full-nfs-wakeup-wait-next-20111014+
------------------------  ------------------------
                 2683.00        -1.8%      2634.00  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                  811.00       -43.8%       456.00  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                 1049.00       -16.7%       874.00  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                  474.00        -8.4%       434.00  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                   56.00       -23.2%        43.00  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                  106.00       -17.0%        88.00  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                 5179.00       -12.6%      4529.00  TOTAL nfs_nr_commits

3.1.0-rc9-ioless-full-nfs-ino-flushing-next-20111014+  3.1.0-rc9-ioless-full-nfs-wakeup-wait-next-20111014+
------------------------  ------------------------
                17641.00        -2.2%     17257.00  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
               177296.00       -54.1%     81335.00  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                26346.00       +41.6%     37309.00  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                22279.00       +12.7%     25107.00  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                67612.00       -59.7%     27271.00  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                29292.00       -12.4%     25660.00  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
               340466.00       -37.2%    213939.00  TOTAL nfs_nr_writes

3.1.0-rc9-ioless-full-nfs-ino-flushing-next-20111014+  3.1.0-rc9-ioless-full-nfs-wakeup-wait-next-20111014+
------------------------  ------------------------
                    4.97        +1.1%         5.03  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                   29.00       +63.5%        47.43  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                   19.99       +17.0%        23.40  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                   44.66        +5.3%        47.03  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                  408.94       +18.6%       485.03  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                  214.96       +12.3%       241.50  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                  722.54       +17.6%       849.42  TOTAL nfs_commit_size

3.1.0-rc9-ioless-full-nfs-ino-flushing-next-20111014+  3.1.0-rc9-ioless-full-nfs-wakeup-wait-next-20111014+
------------------------  ------------------------
                    0.76        +1.5%         0.77  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                    0.13      +100.4%         0.27  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                    0.80       -31.1%         0.55  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                    0.95       -14.4%         0.81  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                    0.34      +125.8%         0.76  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                    0.78        +6.5%         0.83  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                    3.75        +6.3%         3.99  TOTAL nfs_write_size

3.1.0-rc9-ioless-full-nfs-ino-flushing-next-20111014+  3.1.0-rc9-ioless-full-nfs-wakeup-wait-next-20111014+
------------------------  ------------------------
                  460.14       -29.9%       322.63  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                  718.69       +68.2%      1208.67  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                 5203.60       -26.1%      3843.39  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                  427.40       +93.4%       826.64  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                 7369.68       -18.0%      6041.98  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                 1297.87       -23.6%       992.02  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                15477.38       -14.5%     13235.34  TOTAL nfs_write_queue_time

3.1.0-rc9-ioless-full-nfs-ino-flushing-next-20111014+  3.1.0-rc9-ioless-full-nfs-wakeup-wait-next-20111014+
------------------------  ------------------------
                  134.58        -8.9%       122.60  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                   33.24       +48.8%        49.46  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                   89.69       -19.4%        72.31  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                  129.86       -35.3%        84.03  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                   48.33      +239.4%       164.05  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                   81.84       +13.1%        92.55  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                  517.54       +13.0%       585.00  TOTAL nfs_write_rtt_time

3.1.0-rc9-ioless-full-nfs-ino-flushing-next-20111014+  3.1.0-rc9-ioless-full-nfs-wakeup-wait-next-20111014+
------------------------  ------------------------
                  595.23       -25.1%       445.75  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                  753.68       +67.4%      1261.46  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                 5294.55       -26.0%      3917.11  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                  559.44       +63.1%       912.46  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                 7424.41       -16.2%      6221.64  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                 1383.78       -21.3%      1089.67  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                16011.09       -13.5%     13848.09  TOTAL nfs_write_execute_time

3.1.0-rc9-ioless-full-nfs-ino-flushing-next-20111014+  3.1.0-rc9-ioless-full-nfs-wakeup-wait-next-20111014+
------------------------  ------------------------
                    2.34        +2.5%         2.40  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                    1.59      +488.4%         9.37  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                    2.63       +47.0%         3.86  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                   68.22        -6.5%        63.78  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                  618.34       -52.9%       291.44  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                   21.54       +56.9%        33.80  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                  714.65       -43.4%       404.65  TOTAL nfs_commit_queue_time

3.1.0-rc9-ioless-full-nfs-ino-flushing-next-20111014+  3.1.0-rc9-ioless-full-nfs-wakeup-wait-next-20111014+
------------------------  ------------------------
                  766.76        +1.7%       779.63  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                  344.34       +49.3%       514.10  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                  431.90       +15.0%       496.81  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                 3743.78        +5.6%      3954.60  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                 3699.59       +14.0%      4216.05  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                 3801.55        +5.2%      3999.17  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                12787.93        +9.2%     13960.35  TOTAL nfs_commit_rtt_time

3.1.0-rc9-ioless-full-nfs-ino-flushing-next-20111014+  3.1.0-rc9-ioless-full-nfs-wakeup-wait-next-20111014+
------------------------  ------------------------
                  769.38        +1.7%       782.27  NFS-thresh=100M/nfs-10dd-4k-32p-32768M-100M:10-X
                  346.06       +51.3%       523.52  NFS-thresh=100M/nfs-1dd-4k-32p-32768M-100M:10-X
                  434.96       +15.2%       501.07  NFS-thresh=100M/nfs-2dd-4k-32p-32768M-100M:10-X
                 3813.17        +5.4%      4020.45  NFS-thresh=1G/nfs-10dd-4k-32p-32768M-1024M:10-X
                 4318.75        +4.4%      4507.51  NFS-thresh=1G/nfs-1dd-4k-32p-32768M-1024M:10-X
                 3837.62        +5.6%      4052.61  NFS-thresh=1G/nfs-2dd-4k-32p-32768M-1024M:10-X
                13519.94        +6.4%     14387.44  TOTAL nfs_commit_execute_time

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2011-10-20  3:59 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-03 13:42 [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
2011-10-03 13:42 ` [PATCH 01/11] writeback: account per-bdi accumulated dirtied pages Wu Fengguang
2011-10-03 13:42 ` [PATCH 02/11] writeback: dirty position control Wu Fengguang
2011-10-03 13:42 ` [PATCH 03/11] writeback: add bg_threshold parameter to __bdi_update_bandwidth() Wu Fengguang
2011-10-03 13:42 ` [PATCH 04/11] writeback: dirty rate control Wu Fengguang
2011-10-03 13:42 ` [PATCH 05/11] writeback: stabilize bdi->dirty_ratelimit Wu Fengguang
2011-10-03 13:42 ` [PATCH 06/11] writeback: per task dirty rate limit Wu Fengguang
2011-10-03 13:42 ` [PATCH 07/11] writeback: IO-less balance_dirty_pages() Wu Fengguang
2011-10-03 13:42 ` [PATCH 08/11] writeback: limit max dirty pause time Wu Fengguang
2011-10-03 13:42 ` [PATCH 09/11] writeback: control " Wu Fengguang
2011-10-03 13:42 ` [PATCH 10/11] writeback: dirty position control - bdi reserve area Wu Fengguang
2011-10-03 13:42 ` [PATCH 11/11] writeback: per-bdi background threshold Wu Fengguang
2011-10-03 13:59 ` [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
2011-10-05  1:42   ` Wu Fengguang
2011-10-04 19:52 ` Vivek Goyal
2011-10-05 13:56   ` Wu Fengguang
2011-10-05 15:16   ` Andi Kleen
2011-10-10 12:14 ` Peter Zijlstra
2011-10-10 13:07   ` Wu Fengguang
2011-10-10 13:10     ` [RFC][PATCH 1/2] nfs: writeback pages wait queue Wu Fengguang
2011-10-10 13:11       ` [RFC][PATCH 2/2] nfs: scale writeback threshold proportional to dirty threshold Wu Fengguang
2011-10-18  8:53         ` Wu Fengguang
2011-10-18  8:59           ` Wu Fengguang
2011-10-20  2:49             ` Wu Fengguang
2011-10-18  8:51       ` [RFC][PATCH 1/2] nfs: writeback pages wait queue Wu Fengguang
2011-10-20  3:59         ` Wu Fengguang
2011-10-10 14:28     ` [PATCH 00/11] IO-less dirty throttling v12 Wu Fengguang
2011-10-17  3:03       ` Wu Fengguang
2011-10-20  3:39 ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).