All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] IO-less dirty throttling v9
@ 2011-08-16  2:20 ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML, Wu Fengguang

Hi,

The core bits of the IO-less balance_dirty_pages().

        git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v9

Changes since v8:

- a lot of renames and comment/changelog rework
- use 3rd order polynomial as the global control line (Peter)
- stabilize dirty_ratelimit by decreasing update step size on small errors
- limit per-CPU dirtied pages to avoid dirty pages run away on 1k+ tasks (Peter)

Thanks a lot to Peter and Andrea, Vivek for the careful reviews!

shortlog:
        
        Wu Fengguang (5):
              writeback: account per-bdi accumulated dirtied pages
              writeback: dirty position control
              writeback: dirty rate control
              writeback: per task dirty rate limit
              writeback: IO-less balance_dirty_pages()

        The last 4 patches are one single logical change, but splitted here to
        make it easier to review the different parts of the algorithm.

diffstat:

	 fs/fs-writeback.c                |    2 
	 include/linux/backing-dev.h      |    8 
	 include/linux/sched.h            |    7 
	 include/linux/writeback.h        |    1 
	 include/trace/events/writeback.h |   24 -
	 kernel/fork.c                    |    3 
	 mm/backing-dev.c                 |    3 
	 mm/page-writeback.c              |  544 ++++++++++++++++++++---------
	 8 files changed, 414 insertions(+), 178 deletions(-)

Thanks,
Fengguang



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 0/5] IO-less dirty throttling v9
@ 2011-08-16  2:20 ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML, Wu Fengguang

Hi,

The core bits of the IO-less balance_dirty_pages().

        git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v9

Changes since v8:

- a lot of renames and comment/changelog rework
- use 3rd order polynomial as the global control line (Peter)
- stabilize dirty_ratelimit by decreasing update step size on small errors
- limit per-CPU dirtied pages to avoid dirty pages run away on 1k+ tasks (Peter)

Thanks a lot to Peter and Andrea, Vivek for the careful reviews!

shortlog:
        
        Wu Fengguang (5):
              writeback: account per-bdi accumulated dirtied pages
              writeback: dirty position control
              writeback: dirty rate control
              writeback: per task dirty rate limit
              writeback: IO-less balance_dirty_pages()

        The last 4 patches are one single logical change, but splitted here to
        make it easier to review the different parts of the algorithm.

diffstat:

	 fs/fs-writeback.c                |    2 
	 include/linux/backing-dev.h      |    8 
	 include/linux/sched.h            |    7 
	 include/linux/writeback.h        |    1 
	 include/trace/events/writeback.h |   24 -
	 kernel/fork.c                    |    3 
	 mm/backing-dev.c                 |    3 
	 mm/page-writeback.c              |  544 ++++++++++++++++++++---------
	 8 files changed, 414 insertions(+), 178 deletions(-)

Thanks,
Fengguang


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 1/5] writeback: account per-bdi accumulated dirtied pages
  2011-08-16  2:20 ` Wu Fengguang
@ 2011-08-16  2:20   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Jan Kara, Michael Rubin, Wu Fengguang,
	Andrew Morton, Christoph Hellwig, Dave Chinner, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-bdi-dirtied.patch --]
[-- Type: text/plain, Size: 2019 bytes --]

Introduce the BDI_DIRTIED counter. It will be used for estimating the
bdi's dirty bandwidth.

CC: Jan Kara <jack@suse.cz>
CC: Michael Rubin <mrubin@google.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    2 ++
 mm/page-writeback.c         |    1 +
 3 files changed, 4 insertions(+)

--- linux-next.orig/include/linux/backing-dev.h	2011-06-12 20:58:31.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-06-12 20:58:40.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_DIRTIED,
 	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
--- linux-next.orig/mm/page-writeback.c	2011-06-12 20:58:31.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-12 20:58:40.000000000 +0800
@@ -1530,6 +1530,7 @@ void account_page_dirtied(struct page *p
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
 	}
--- linux-next.orig/mm/backing-dev.c	2011-06-12 20:58:31.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-06-12 20:58:55.000000000 +0800
@@ -97,6 +97,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:     %10lu kB\n"
 		   "DirtyThresh:        %10lu kB\n"
 		   "BackgroundThresh:   %10lu kB\n"
+		   "BdiDirtied:         %10lu kB\n"
 		   "BdiWritten:         %10lu kB\n"
 		   "BdiWriteBandwidth:  %10lu kBps\n"
 		   "b_dirty:            %10lu\n"
@@ -109,6 +110,7 @@ static int bdi_debug_stats_show(struct s
 		   K(bdi_thresh),
 		   K(dirty_thresh),
 		   K(background_thresh),
+		   (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
 		   (unsigned long) K(bdi->write_bandwidth),
 		   nr_dirty,



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 1/5] writeback: account per-bdi accumulated dirtied pages
@ 2011-08-16  2:20   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Jan Kara, Michael Rubin, Wu Fengguang,
	Andrew Morton, Christoph Hellwig, Dave Chinner, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-bdi-dirtied.patch --]
[-- Type: text/plain, Size: 2322 bytes --]

Introduce the BDI_DIRTIED counter. It will be used for estimating the
bdi's dirty bandwidth.

CC: Jan Kara <jack@suse.cz>
CC: Michael Rubin <mrubin@google.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    2 ++
 mm/page-writeback.c         |    1 +
 3 files changed, 4 insertions(+)

--- linux-next.orig/include/linux/backing-dev.h	2011-06-12 20:58:31.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-06-12 20:58:40.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_DIRTIED,
 	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
--- linux-next.orig/mm/page-writeback.c	2011-06-12 20:58:31.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-12 20:58:40.000000000 +0800
@@ -1530,6 +1530,7 @@ void account_page_dirtied(struct page *p
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
 	}
--- linux-next.orig/mm/backing-dev.c	2011-06-12 20:58:31.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-06-12 20:58:55.000000000 +0800
@@ -97,6 +97,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:     %10lu kB\n"
 		   "DirtyThresh:        %10lu kB\n"
 		   "BackgroundThresh:   %10lu kB\n"
+		   "BdiDirtied:         %10lu kB\n"
 		   "BdiWritten:         %10lu kB\n"
 		   "BdiWriteBandwidth:  %10lu kBps\n"
 		   "b_dirty:            %10lu\n"
@@ -109,6 +110,7 @@ static int bdi_debug_stats_show(struct s
 		   K(bdi_thresh),
 		   K(dirty_thresh),
 		   K(background_thresh),
+		   (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
 		   (unsigned long) K(bdi->write_bandwidth),
 		   nr_dirty,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 2/5] writeback: dirty position control
  2011-08-16  2:20 ` Wu Fengguang
  (?)
@ 2011-08-16  2:20   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 13157 bytes --]

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range [setpoint - write_bw/2, setpoint + write_bw/2],
we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the write bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  196 +++++++++++++++++++++++++++++++++++-
 3 files changed, 193 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-14 18:03:49.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-14 21:33:39.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -495,6 +503,180 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0                 setpoint                     x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* the target balance point */
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                         setpoint - dirty 3
+	 *        f(dirty) := 1 + (----------------)
+	 *                         limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx       < 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = setpoint + 8 * write_bw
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
+	 * for various filesystems, where (2) can yield in a reasonable 12.5%
+	 * fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly.
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
+	setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(4*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 */
+	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
+		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
+		       thresh + 1);
+	x_intercept = setpoint + 2 * span;
+
+	if (unlikely(bdi_dirty > setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -593,6 +775,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -629,6 +812,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -637,8 +821,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -679,7 +863,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -723,8 +908,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-14 18:03:50.000000000 +0800
@@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-14 18:03:50.000000000 +0800
@@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 2/5] writeback: dirty position control
@ 2011-08-16  2:20   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 13460 bytes --]

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range [setpoint - write_bw/2, setpoint + write_bw/2],
we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the write bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  196 +++++++++++++++++++++++++++++++++++-
 3 files changed, 193 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-14 18:03:49.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-14 21:33:39.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -495,6 +503,180 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0                 setpoint                     x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* the target balance point */
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                         setpoint - dirty 3
+	 *        f(dirty) := 1 + (----------------)
+	 *                         limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx       < 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = setpoint + 8 * write_bw
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
+	 * for various filesystems, where (2) can yield in a reasonable 12.5%
+	 * fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly.
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
+	setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(4*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 */
+	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
+		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
+		       thresh + 1);
+	x_intercept = setpoint + 2 * span;
+
+	if (unlikely(bdi_dirty > setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -593,6 +775,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -629,6 +812,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -637,8 +821,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -679,7 +863,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -723,8 +908,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-14 18:03:50.000000000 +0800
@@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-14 18:03:50.000000000 +0800
@@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 2/5] writeback: dirty position control
@ 2011-08-16  2:20   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 13460 bytes --]

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range [setpoint - write_bw/2, setpoint + write_bw/2],
we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the write bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  196 +++++++++++++++++++++++++++++++++++-
 3 files changed, 193 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-14 18:03:49.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-14 21:33:39.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -495,6 +503,180 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0                 setpoint                     x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* the target balance point */
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                         setpoint - dirty 3
+	 *        f(dirty) := 1 + (----------------)
+	 *                         limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx       < 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = setpoint + 8 * write_bw
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
+	 * for various filesystems, where (2) can yield in a reasonable 12.5%
+	 * fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly.
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
+	setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(4*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 */
+	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
+		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
+		       thresh + 1);
+	x_intercept = setpoint + 2 * span;
+
+	if (unlikely(bdi_dirty > setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -593,6 +775,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -629,6 +812,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -637,8 +821,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -679,7 +863,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -723,8 +908,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-14 18:03:50.000000000 +0800
@@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-14 18:03:50.000000000 +0800
@@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 3/5] writeback: dirty rate control
  2011-08-16  2:20 ` Wu Fengguang
  (?)
@ 2011-08-16  2:20   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: dirty-ratelimit --]
[-- Type: text/plain, Size: 12818 bytes --]

It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
when there are N dd tasks.

On write() syscall, use bdi->dirty_ratelimit
============================================

    balance_dirty_pages(pages_dirtied)
    {
        task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
        pause = pages_dirtied / task_ratelimit;
        sleep(pause);
    }

On every 200ms, update bdi->dirty_ratelimit
===========================================

    bdi_update_dirty_ratelimit()
    {
    	pos_rate = ratelimit_in_past_200ms
		 = bdi->dirty_ratelimit * bdi_position_ratio();

	balanced_rate = ratelimit_in_past_200ms * write_bw / dirty_rate;

        update bdi->dirty_ratelimit closer to balanced_rate and pos_rate
    }

Estimation of balanced bdi->dirty_ratelimit
===========================================

balanced task_ratelimit
-----------------------

balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.

IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:

	ratelimit = write_bw						(1)

The fairness requirement gives us:

        task_ratelimit = write_bw / N					(2)

where N is the number of dd tasks.  We don't know N beforehand, but
still can estimate the balanced task_ratelimit within 200ms.

Start by throttling each dd task at rate

        task_ratelimit = task_ratelimit_0				(3)
 		  	 (any non-zero initial value is OK)

After 200ms, we measured

        dirty_rate = # of pages dirtied by all dd's / 200ms
        write_bw   = # of pages written to the disk / 200ms

For the aggressive dd dirtiers, the equality holds

	dirty_rate == N * task_rate
                   == N * task_ratelimit
                   == N * task_ratelimit_0            			(4)
Or
	task_ratelimit_0 = dirty_rate / N            			(5)

Now we conclude that the balanced task ratelimit can be estimated by

        task_ratelimit = task_ratelimit_0 * (write_bw / dirty_rate)	(6)

Because with (4) and (5) we can get the desired equality (1):

	task_ratelimit == (dirty_rate / N) * (write_bw / dirty_rate)
	       	       == write_bw / N

Then using the balanced task ratelimit we can compute task pause times like:

        task_pause = task->nr_dirtied / task_ratelimit

task_ratelimit with position control
------------------------------------

However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.

The dirty position control works by splitting (6) to

        task_ratelimit = balanced_rate					(7)
        balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)	(8)

and extend (7) to

        task_ratelimit = balanced_rate * pos_ratio			(9)

where pos_ratio is a negative feedback function that subjects to

1) f(setpoint) = 1.0
2) df/dx < 0

That is, if the dirty pages are ABOVE the setpoint, we throttle each
task a bit more HEAVY than balanced_rate, so that the dirty pages are
created less fast than they are cleaned, thus DROP to the setpoints
(and the reverse).

bdi->dirty_ratelimit update policy
----------------------------------

The balanced_rate calculated by (8) is not suitable for direct use (*).
For the reasons listed below, (9) is further transformed into

	task_ratelimit = dirty_ratelimit * pos_ratio			(10)

where dirty_ratelimit will be tracking balanced_rate _conservatively_.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
(*) There are some imperfections in balanced_rate, which make it not
suitable for direct use:

1) large fluctuations

The dirty_rate used for computing balanced_rate is merely averaged in
the past 200ms (very small comparing to the 3s estimation period for
write_bw), which makes rather dispersed distribution of balanced_rate.

It's pretty hard to average out the singular points by increasing the
estimation period. Considering that the averaging technique will
introduce very undesirable time lags, I give it up totally. (btw, the 3s
write_bw averaging time lag is much more acceptable because its impact
is one-way and therefore won't lead to oscillations.)

The more practical way is filtering -- most singular balanced_rate
points can be filtered out by remembering some prev_balanced_rate and
prev_prev_balanced_rate. However the more reliable way is to guard
balanced_rate with pos_rate.

2) due to truncates and fs redirties, the (write_bw <=> dirty_rate)
match could become unbalanced, which may lead to large systematical
errors in balanced_rate. The truncates, due to its possibly bumpy
nature, can hardly be compensated smoothly. So let's face it. When some
over-estimated balanced_rate brings dirty_ratelimit high, dirty pages
will go higher than the setpoint. pos_rate will in turn become lower
than dirty_ratelimit.  So if we consider both balanced_rate and pos_rate
and update dirty_ratelimit only when they are on the same side of
dirty_ratelimit, the systematical errors in balanced_rate won't be able
to bring dirty_ratelimit far away.

The balanced_rate estimation may also be inaccurate when near the max
pause and free run areas, however is less an issue.

3) since we ultimately want to

- keep the fluctuations of task ratelimit as small as possible
- keep the dirty pages around the setpoint as long time as possible

the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (pos_rate < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < balanced_rate), there is
no point to bring up dirty_ratelimit in a hurry only to hurt both the
above two goals.

In summary, the dirty_ratelimit update policy consists of two constraints:

1) avoid changing dirty rate when it's against the position control target
   (the adjusted rate will slow down the progress of dirty pages going
   back to setpoint).

2) limit the step size. pos_rate is changing values step by step,
   leaving a consistent trace comparing to the randomly jumping
   balanced_rate. pos_rate also has the nice smaller errors in stable
   state and typically larger errors when there are big errors in rate.
   So it's a pretty good limiting factor for the step size of dirty_ratelimit.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
---
 include/linux/backing-dev.h |    7 ++
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |  108 +++++++++++++++++++++++++++++++++-
 3 files changed, 114 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2011-08-16 10:07:22.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-16 10:07:23.000000000 +0800
@@ -75,10 +75,17 @@ struct backing_dev_info {
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	unsigned long bw_time_stamp;	/* last time write bw is updated */
+	unsigned long dirtied_stamp;
 	unsigned long written_stamp;	/* pages written at bw_time_stamp */
 	unsigned long write_bandwidth;	/* the estimated write bandwidth */
 	unsigned long avg_write_bandwidth; /* further smoothed write bw */
 
+	/*
+	 * The base dirty throttle rate, re-calculated on every 200ms.
+	 * All the bdi tasks' dirty rate will be curbed under it.
+	 */
+	unsigned long dirty_ratelimit;
+
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
--- linux-next.orig/mm/backing-dev.c	2011-08-16 10:07:22.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-08-16 10:07:23.000000000 +0800
@@ -674,6 +674,7 @@ int bdi_init(struct backing_dev_info *bd
 	bdi->bw_time_stamp = jiffies;
 	bdi->written_stamp = 0;
 
+	bdi->dirty_ratelimit = INIT_BW;
 	bdi->write_bandwidth = INIT_BW;
 	bdi->avg_write_bandwidth = INIT_BW;
 
--- linux-next.orig/mm/page-writeback.c	2011-08-16 10:07:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-16 10:13:33.000000000 +0800
@@ -773,6 +773,104 @@ static void global_update_bandwidth(unsi
 	spin_unlock(&dirty_lock);
 }
 
+/*
+ * Maintain bdi->dirty_ratelimit, the base dirty throttle rate.
+ *
+ * Normal bdi tasks will be curbed at or below it in long term.
+ * Obviously it should be around (write_bw / N) when there are N dd tasks.
+ */
+static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
+				       unsigned long thresh,
+				       unsigned long bg_thresh,
+				       unsigned long dirty,
+				       unsigned long bdi_thresh,
+				       unsigned long bdi_dirty,
+				       unsigned long dirtied,
+				       unsigned long elapsed)
+{
+	unsigned long base_rate = bdi->dirty_ratelimit;
+	unsigned long dirty_rate;
+	unsigned long executed_rate;
+	unsigned long balanced_rate;
+	unsigned long pos_rate;
+	unsigned long delta;
+	unsigned long pos_ratio;
+
+	/*
+	 * The dirty rate will match the writeback rate in long term, except
+	 * when dirty pages are truncated by userspace or re-dirtied by FS.
+	 */
+	dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+
+	pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
+				       bdi_thresh, bdi_dirty);
+	/*
+	 * executed_rate reflects each dd's dirty rate for the past 200ms.
+	 */
+	executed_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * A linear estimation of the "balanced" throttle bandwidth.
+	 */
+	balanced_rate = div_u64((u64)executed_rate * bdi->avg_write_bandwidth,
+				dirty_rate | 1);
+
+	/*
+	 * Use a different name for the same value to distinguish the concepts.
+	 * Only the relative value of
+	 *     (pos_rate - base_rate) = (pos_ratio - 1) * base_rate
+	 * will be used below, which reflects the direction and size of dirty
+	 * position error.
+	 */
+	pos_rate = executed_rate;
+
+	/*
+	 * dirty_ratelimit will follow balanced_rate iff pos_rate is on the
+	 * same side of dirty_ratelimit, too.
+	 * For example,
+	 * - (base_rate > balanced_rate) => dirty rate is too high
+	 * - (base_rate > pos_rate)      => dirty pages are above setpoint
+	 * so lowering base_rate will help meet both the position and rate
+	 * control targets. Otherwise, don't update base_rate if it will only
+	 * help meet the rate target. After all, what the users ultimately feel
+	 * and care are stable dirty rate and small position error.  This
+	 * update policy can also prevent dirty_ratelimit from being driven
+	 * away by possible systematic errors in balanced_rate.
+	 *
+	 * |base_rate - pos_rate| is also used to limit the step size for
+	 * filtering out the sigular points of balanced_rate, which keeps
+	 * jumping around randomly and can even leap far away at times due to
+	 * the small 200ms estimation period of dirty_rate (we want to keep
+	 * that period small to reduce time lags).
+	 */
+	delta = 0;
+	if (base_rate < balanced_rate) {
+		if (base_rate < pos_rate)
+			delta = min(balanced_rate, pos_rate) - base_rate;
+	} else {
+		if (base_rate > pos_rate)
+			delta = base_rate - max(balanced_rate, pos_rate);
+	}
+
+	/*
+	 * Don't pursue 100% rate matching. It's impossible since the balanced
+	 * rate itself is constantly fluctuating. So decrease the track speed
+	 * when it gets close to the target. Helps eliminate pointless tremors.
+	 */
+	delta >>= base_rate / (8 * delta + 1);
+	/*
+	 * Limit the tracking speed to avoid overshooting.
+	 */
+	delta = (delta + 7) / 8;
+
+	if (base_rate < balanced_rate)
+		base_rate += delta;
+	else
+		base_rate -= delta;
+
+	bdi->dirty_ratelimit = max(base_rate, 1);
+}
+
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
 			    unsigned long bg_thresh,
@@ -783,6 +881,7 @@ void __bdi_update_bandwidth(struct backi
 {
 	unsigned long now = jiffies;
 	unsigned long elapsed = now - bdi->bw_time_stamp;
+	unsigned long dirtied;
 	unsigned long written;
 
 	/*
@@ -791,6 +890,7 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed < BANDWIDTH_INTERVAL)
 		return;
 
+	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
 
 	/*
@@ -800,12 +900,16 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
 		goto snapshot;
 
-	if (thresh)
+	if (thresh) {
 		global_update_bandwidth(thresh, dirty, now);
-
+		bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
+					   bdi_thresh, bdi_dirty,
+					   dirtied, elapsed);
+	}
 	bdi_update_write_bandwidth(bdi, elapsed, written);
 
 snapshot:
+	bdi->dirtied_stamp = dirtied;
 	bdi->written_stamp = written;
 	bdi->bw_time_stamp = now;
 }



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 3/5] writeback: dirty rate control
@ 2011-08-16  2:20   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: dirty-ratelimit --]
[-- Type: text/plain, Size: 13121 bytes --]

It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
when there are N dd tasks.

On write() syscall, use bdi->dirty_ratelimit
============================================

    balance_dirty_pages(pages_dirtied)
    {
        task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
        pause = pages_dirtied / task_ratelimit;
        sleep(pause);
    }

On every 200ms, update bdi->dirty_ratelimit
===========================================

    bdi_update_dirty_ratelimit()
    {
    	pos_rate = ratelimit_in_past_200ms
		 = bdi->dirty_ratelimit * bdi_position_ratio();

	balanced_rate = ratelimit_in_past_200ms * write_bw / dirty_rate;

        update bdi->dirty_ratelimit closer to balanced_rate and pos_rate
    }

Estimation of balanced bdi->dirty_ratelimit
===========================================

balanced task_ratelimit
-----------------------

balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.

IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:

	ratelimit = write_bw						(1)

The fairness requirement gives us:

        task_ratelimit = write_bw / N					(2)

where N is the number of dd tasks.  We don't know N beforehand, but
still can estimate the balanced task_ratelimit within 200ms.

Start by throttling each dd task at rate

        task_ratelimit = task_ratelimit_0				(3)
 		  	 (any non-zero initial value is OK)

After 200ms, we measured

        dirty_rate = # of pages dirtied by all dd's / 200ms
        write_bw   = # of pages written to the disk / 200ms

For the aggressive dd dirtiers, the equality holds

	dirty_rate == N * task_rate
                   == N * task_ratelimit
                   == N * task_ratelimit_0            			(4)
Or
	task_ratelimit_0 = dirty_rate / N            			(5)

Now we conclude that the balanced task ratelimit can be estimated by

        task_ratelimit = task_ratelimit_0 * (write_bw / dirty_rate)	(6)

Because with (4) and (5) we can get the desired equality (1):

	task_ratelimit == (dirty_rate / N) * (write_bw / dirty_rate)
	       	       == write_bw / N

Then using the balanced task ratelimit we can compute task pause times like:

        task_pause = task->nr_dirtied / task_ratelimit

task_ratelimit with position control
------------------------------------

However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.

The dirty position control works by splitting (6) to

        task_ratelimit = balanced_rate					(7)
        balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)	(8)

and extend (7) to

        task_ratelimit = balanced_rate * pos_ratio			(9)

where pos_ratio is a negative feedback function that subjects to

1) f(setpoint) = 1.0
2) df/dx < 0

That is, if the dirty pages are ABOVE the setpoint, we throttle each
task a bit more HEAVY than balanced_rate, so that the dirty pages are
created less fast than they are cleaned, thus DROP to the setpoints
(and the reverse).

bdi->dirty_ratelimit update policy
----------------------------------

The balanced_rate calculated by (8) is not suitable for direct use (*).
For the reasons listed below, (9) is further transformed into

	task_ratelimit = dirty_ratelimit * pos_ratio			(10)

where dirty_ratelimit will be tracking balanced_rate _conservatively_.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
(*) There are some imperfections in balanced_rate, which make it not
suitable for direct use:

1) large fluctuations

The dirty_rate used for computing balanced_rate is merely averaged in
the past 200ms (very small comparing to the 3s estimation period for
write_bw), which makes rather dispersed distribution of balanced_rate.

It's pretty hard to average out the singular points by increasing the
estimation period. Considering that the averaging technique will
introduce very undesirable time lags, I give it up totally. (btw, the 3s
write_bw averaging time lag is much more acceptable because its impact
is one-way and therefore won't lead to oscillations.)

The more practical way is filtering -- most singular balanced_rate
points can be filtered out by remembering some prev_balanced_rate and
prev_prev_balanced_rate. However the more reliable way is to guard
balanced_rate with pos_rate.

2) due to truncates and fs redirties, the (write_bw <=> dirty_rate)
match could become unbalanced, which may lead to large systematical
errors in balanced_rate. The truncates, due to its possibly bumpy
nature, can hardly be compensated smoothly. So let's face it. When some
over-estimated balanced_rate brings dirty_ratelimit high, dirty pages
will go higher than the setpoint. pos_rate will in turn become lower
than dirty_ratelimit.  So if we consider both balanced_rate and pos_rate
and update dirty_ratelimit only when they are on the same side of
dirty_ratelimit, the systematical errors in balanced_rate won't be able
to bring dirty_ratelimit far away.

The balanced_rate estimation may also be inaccurate when near the max
pause and free run areas, however is less an issue.

3) since we ultimately want to

- keep the fluctuations of task ratelimit as small as possible
- keep the dirty pages around the setpoint as long time as possible

the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (pos_rate < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < balanced_rate), there is
no point to bring up dirty_ratelimit in a hurry only to hurt both the
above two goals.

In summary, the dirty_ratelimit update policy consists of two constraints:

1) avoid changing dirty rate when it's against the position control target
   (the adjusted rate will slow down the progress of dirty pages going
   back to setpoint).

2) limit the step size. pos_rate is changing values step by step,
   leaving a consistent trace comparing to the randomly jumping
   balanced_rate. pos_rate also has the nice smaller errors in stable
   state and typically larger errors when there are big errors in rate.
   So it's a pretty good limiting factor for the step size of dirty_ratelimit.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
---
 include/linux/backing-dev.h |    7 ++
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |  108 +++++++++++++++++++++++++++++++++-
 3 files changed, 114 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2011-08-16 10:07:22.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-16 10:07:23.000000000 +0800
@@ -75,10 +75,17 @@ struct backing_dev_info {
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	unsigned long bw_time_stamp;	/* last time write bw is updated */
+	unsigned long dirtied_stamp;
 	unsigned long written_stamp;	/* pages written at bw_time_stamp */
 	unsigned long write_bandwidth;	/* the estimated write bandwidth */
 	unsigned long avg_write_bandwidth; /* further smoothed write bw */
 
+	/*
+	 * The base dirty throttle rate, re-calculated on every 200ms.
+	 * All the bdi tasks' dirty rate will be curbed under it.
+	 */
+	unsigned long dirty_ratelimit;
+
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
--- linux-next.orig/mm/backing-dev.c	2011-08-16 10:07:22.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-08-16 10:07:23.000000000 +0800
@@ -674,6 +674,7 @@ int bdi_init(struct backing_dev_info *bd
 	bdi->bw_time_stamp = jiffies;
 	bdi->written_stamp = 0;
 
+	bdi->dirty_ratelimit = INIT_BW;
 	bdi->write_bandwidth = INIT_BW;
 	bdi->avg_write_bandwidth = INIT_BW;
 
--- linux-next.orig/mm/page-writeback.c	2011-08-16 10:07:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-16 10:13:33.000000000 +0800
@@ -773,6 +773,104 @@ static void global_update_bandwidth(unsi
 	spin_unlock(&dirty_lock);
 }
 
+/*
+ * Maintain bdi->dirty_ratelimit, the base dirty throttle rate.
+ *
+ * Normal bdi tasks will be curbed at or below it in long term.
+ * Obviously it should be around (write_bw / N) when there are N dd tasks.
+ */
+static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
+				       unsigned long thresh,
+				       unsigned long bg_thresh,
+				       unsigned long dirty,
+				       unsigned long bdi_thresh,
+				       unsigned long bdi_dirty,
+				       unsigned long dirtied,
+				       unsigned long elapsed)
+{
+	unsigned long base_rate = bdi->dirty_ratelimit;
+	unsigned long dirty_rate;
+	unsigned long executed_rate;
+	unsigned long balanced_rate;
+	unsigned long pos_rate;
+	unsigned long delta;
+	unsigned long pos_ratio;
+
+	/*
+	 * The dirty rate will match the writeback rate in long term, except
+	 * when dirty pages are truncated by userspace or re-dirtied by FS.
+	 */
+	dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+
+	pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
+				       bdi_thresh, bdi_dirty);
+	/*
+	 * executed_rate reflects each dd's dirty rate for the past 200ms.
+	 */
+	executed_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * A linear estimation of the "balanced" throttle bandwidth.
+	 */
+	balanced_rate = div_u64((u64)executed_rate * bdi->avg_write_bandwidth,
+				dirty_rate | 1);
+
+	/*
+	 * Use a different name for the same value to distinguish the concepts.
+	 * Only the relative value of
+	 *     (pos_rate - base_rate) = (pos_ratio - 1) * base_rate
+	 * will be used below, which reflects the direction and size of dirty
+	 * position error.
+	 */
+	pos_rate = executed_rate;
+
+	/*
+	 * dirty_ratelimit will follow balanced_rate iff pos_rate is on the
+	 * same side of dirty_ratelimit, too.
+	 * For example,
+	 * - (base_rate > balanced_rate) => dirty rate is too high
+	 * - (base_rate > pos_rate)      => dirty pages are above setpoint
+	 * so lowering base_rate will help meet both the position and rate
+	 * control targets. Otherwise, don't update base_rate if it will only
+	 * help meet the rate target. After all, what the users ultimately feel
+	 * and care are stable dirty rate and small position error.  This
+	 * update policy can also prevent dirty_ratelimit from being driven
+	 * away by possible systematic errors in balanced_rate.
+	 *
+	 * |base_rate - pos_rate| is also used to limit the step size for
+	 * filtering out the sigular points of balanced_rate, which keeps
+	 * jumping around randomly and can even leap far away at times due to
+	 * the small 200ms estimation period of dirty_rate (we want to keep
+	 * that period small to reduce time lags).
+	 */
+	delta = 0;
+	if (base_rate < balanced_rate) {
+		if (base_rate < pos_rate)
+			delta = min(balanced_rate, pos_rate) - base_rate;
+	} else {
+		if (base_rate > pos_rate)
+			delta = base_rate - max(balanced_rate, pos_rate);
+	}
+
+	/*
+	 * Don't pursue 100% rate matching. It's impossible since the balanced
+	 * rate itself is constantly fluctuating. So decrease the track speed
+	 * when it gets close to the target. Helps eliminate pointless tremors.
+	 */
+	delta >>= base_rate / (8 * delta + 1);
+	/*
+	 * Limit the tracking speed to avoid overshooting.
+	 */
+	delta = (delta + 7) / 8;
+
+	if (base_rate < balanced_rate)
+		base_rate += delta;
+	else
+		base_rate -= delta;
+
+	bdi->dirty_ratelimit = max(base_rate, 1);
+}
+
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
 			    unsigned long bg_thresh,
@@ -783,6 +881,7 @@ void __bdi_update_bandwidth(struct backi
 {
 	unsigned long now = jiffies;
 	unsigned long elapsed = now - bdi->bw_time_stamp;
+	unsigned long dirtied;
 	unsigned long written;
 
 	/*
@@ -791,6 +890,7 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed < BANDWIDTH_INTERVAL)
 		return;
 
+	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
 
 	/*
@@ -800,12 +900,16 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
 		goto snapshot;
 
-	if (thresh)
+	if (thresh) {
 		global_update_bandwidth(thresh, dirty, now);
-
+		bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
+					   bdi_thresh, bdi_dirty,
+					   dirtied, elapsed);
+	}
 	bdi_update_write_bandwidth(bdi, elapsed, written);
 
 snapshot:
+	bdi->dirtied_stamp = dirtied;
 	bdi->written_stamp = written;
 	bdi->bw_time_stamp = now;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 3/5] writeback: dirty rate control
@ 2011-08-16  2:20   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: dirty-ratelimit --]
[-- Type: text/plain, Size: 13121 bytes --]

It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
when there are N dd tasks.

On write() syscall, use bdi->dirty_ratelimit
============================================

    balance_dirty_pages(pages_dirtied)
    {
        task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
        pause = pages_dirtied / task_ratelimit;
        sleep(pause);
    }

On every 200ms, update bdi->dirty_ratelimit
===========================================

    bdi_update_dirty_ratelimit()
    {
    	pos_rate = ratelimit_in_past_200ms
		 = bdi->dirty_ratelimit * bdi_position_ratio();

	balanced_rate = ratelimit_in_past_200ms * write_bw / dirty_rate;

        update bdi->dirty_ratelimit closer to balanced_rate and pos_rate
    }

Estimation of balanced bdi->dirty_ratelimit
===========================================

balanced task_ratelimit
-----------------------

balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.

IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:

	ratelimit = write_bw						(1)

The fairness requirement gives us:

        task_ratelimit = write_bw / N					(2)

where N is the number of dd tasks.  We don't know N beforehand, but
still can estimate the balanced task_ratelimit within 200ms.

Start by throttling each dd task at rate

        task_ratelimit = task_ratelimit_0				(3)
 		  	 (any non-zero initial value is OK)

After 200ms, we measured

        dirty_rate = # of pages dirtied by all dd's / 200ms
        write_bw   = # of pages written to the disk / 200ms

For the aggressive dd dirtiers, the equality holds

	dirty_rate == N * task_rate
                   == N * task_ratelimit
                   == N * task_ratelimit_0            			(4)
Or
	task_ratelimit_0 = dirty_rate / N            			(5)

Now we conclude that the balanced task ratelimit can be estimated by

        task_ratelimit = task_ratelimit_0 * (write_bw / dirty_rate)	(6)

Because with (4) and (5) we can get the desired equality (1):

	task_ratelimit == (dirty_rate / N) * (write_bw / dirty_rate)
	       	       == write_bw / N

Then using the balanced task ratelimit we can compute task pause times like:

        task_pause = task->nr_dirtied / task_ratelimit

task_ratelimit with position control
------------------------------------

However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.

The dirty position control works by splitting (6) to

        task_ratelimit = balanced_rate					(7)
        balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)	(8)

and extend (7) to

        task_ratelimit = balanced_rate * pos_ratio			(9)

where pos_ratio is a negative feedback function that subjects to

1) f(setpoint) = 1.0
2) df/dx < 0

That is, if the dirty pages are ABOVE the setpoint, we throttle each
task a bit more HEAVY than balanced_rate, so that the dirty pages are
created less fast than they are cleaned, thus DROP to the setpoints
(and the reverse).

bdi->dirty_ratelimit update policy
----------------------------------

The balanced_rate calculated by (8) is not suitable for direct use (*).
For the reasons listed below, (9) is further transformed into

	task_ratelimit = dirty_ratelimit * pos_ratio			(10)

where dirty_ratelimit will be tracking balanced_rate _conservatively_.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
(*) There are some imperfections in balanced_rate, which make it not
suitable for direct use:

1) large fluctuations

The dirty_rate used for computing balanced_rate is merely averaged in
the past 200ms (very small comparing to the 3s estimation period for
write_bw), which makes rather dispersed distribution of balanced_rate.

It's pretty hard to average out the singular points by increasing the
estimation period. Considering that the averaging technique will
introduce very undesirable time lags, I give it up totally. (btw, the 3s
write_bw averaging time lag is much more acceptable because its impact
is one-way and therefore won't lead to oscillations.)

The more practical way is filtering -- most singular balanced_rate
points can be filtered out by remembering some prev_balanced_rate and
prev_prev_balanced_rate. However the more reliable way is to guard
balanced_rate with pos_rate.

2) due to truncates and fs redirties, the (write_bw <=> dirty_rate)
match could become unbalanced, which may lead to large systematical
errors in balanced_rate. The truncates, due to its possibly bumpy
nature, can hardly be compensated smoothly. So let's face it. When some
over-estimated balanced_rate brings dirty_ratelimit high, dirty pages
will go higher than the setpoint. pos_rate will in turn become lower
than dirty_ratelimit.  So if we consider both balanced_rate and pos_rate
and update dirty_ratelimit only when they are on the same side of
dirty_ratelimit, the systematical errors in balanced_rate won't be able
to bring dirty_ratelimit far away.

The balanced_rate estimation may also be inaccurate when near the max
pause and free run areas, however is less an issue.

3) since we ultimately want to

- keep the fluctuations of task ratelimit as small as possible
- keep the dirty pages around the setpoint as long time as possible

the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (pos_rate < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < balanced_rate), there is
no point to bring up dirty_ratelimit in a hurry only to hurt both the
above two goals.

In summary, the dirty_ratelimit update policy consists of two constraints:

1) avoid changing dirty rate when it's against the position control target
   (the adjusted rate will slow down the progress of dirty pages going
   back to setpoint).

2) limit the step size. pos_rate is changing values step by step,
   leaving a consistent trace comparing to the randomly jumping
   balanced_rate. pos_rate also has the nice smaller errors in stable
   state and typically larger errors when there are big errors in rate.
   So it's a pretty good limiting factor for the step size of dirty_ratelimit.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
---
 include/linux/backing-dev.h |    7 ++
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |  108 +++++++++++++++++++++++++++++++++-
 3 files changed, 114 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2011-08-16 10:07:22.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-16 10:07:23.000000000 +0800
@@ -75,10 +75,17 @@ struct backing_dev_info {
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	unsigned long bw_time_stamp;	/* last time write bw is updated */
+	unsigned long dirtied_stamp;
 	unsigned long written_stamp;	/* pages written at bw_time_stamp */
 	unsigned long write_bandwidth;	/* the estimated write bandwidth */
 	unsigned long avg_write_bandwidth; /* further smoothed write bw */
 
+	/*
+	 * The base dirty throttle rate, re-calculated on every 200ms.
+	 * All the bdi tasks' dirty rate will be curbed under it.
+	 */
+	unsigned long dirty_ratelimit;
+
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
--- linux-next.orig/mm/backing-dev.c	2011-08-16 10:07:22.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-08-16 10:07:23.000000000 +0800
@@ -674,6 +674,7 @@ int bdi_init(struct backing_dev_info *bd
 	bdi->bw_time_stamp = jiffies;
 	bdi->written_stamp = 0;
 
+	bdi->dirty_ratelimit = INIT_BW;
 	bdi->write_bandwidth = INIT_BW;
 	bdi->avg_write_bandwidth = INIT_BW;
 
--- linux-next.orig/mm/page-writeback.c	2011-08-16 10:07:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-16 10:13:33.000000000 +0800
@@ -773,6 +773,104 @@ static void global_update_bandwidth(unsi
 	spin_unlock(&dirty_lock);
 }
 
+/*
+ * Maintain bdi->dirty_ratelimit, the base dirty throttle rate.
+ *
+ * Normal bdi tasks will be curbed at or below it in long term.
+ * Obviously it should be around (write_bw / N) when there are N dd tasks.
+ */
+static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
+				       unsigned long thresh,
+				       unsigned long bg_thresh,
+				       unsigned long dirty,
+				       unsigned long bdi_thresh,
+				       unsigned long bdi_dirty,
+				       unsigned long dirtied,
+				       unsigned long elapsed)
+{
+	unsigned long base_rate = bdi->dirty_ratelimit;
+	unsigned long dirty_rate;
+	unsigned long executed_rate;
+	unsigned long balanced_rate;
+	unsigned long pos_rate;
+	unsigned long delta;
+	unsigned long pos_ratio;
+
+	/*
+	 * The dirty rate will match the writeback rate in long term, except
+	 * when dirty pages are truncated by userspace or re-dirtied by FS.
+	 */
+	dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+
+	pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
+				       bdi_thresh, bdi_dirty);
+	/*
+	 * executed_rate reflects each dd's dirty rate for the past 200ms.
+	 */
+	executed_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * A linear estimation of the "balanced" throttle bandwidth.
+	 */
+	balanced_rate = div_u64((u64)executed_rate * bdi->avg_write_bandwidth,
+				dirty_rate | 1);
+
+	/*
+	 * Use a different name for the same value to distinguish the concepts.
+	 * Only the relative value of
+	 *     (pos_rate - base_rate) = (pos_ratio - 1) * base_rate
+	 * will be used below, which reflects the direction and size of dirty
+	 * position error.
+	 */
+	pos_rate = executed_rate;
+
+	/*
+	 * dirty_ratelimit will follow balanced_rate iff pos_rate is on the
+	 * same side of dirty_ratelimit, too.
+	 * For example,
+	 * - (base_rate > balanced_rate) => dirty rate is too high
+	 * - (base_rate > pos_rate)      => dirty pages are above setpoint
+	 * so lowering base_rate will help meet both the position and rate
+	 * control targets. Otherwise, don't update base_rate if it will only
+	 * help meet the rate target. After all, what the users ultimately feel
+	 * and care are stable dirty rate and small position error.  This
+	 * update policy can also prevent dirty_ratelimit from being driven
+	 * away by possible systematic errors in balanced_rate.
+	 *
+	 * |base_rate - pos_rate| is also used to limit the step size for
+	 * filtering out the sigular points of balanced_rate, which keeps
+	 * jumping around randomly and can even leap far away at times due to
+	 * the small 200ms estimation period of dirty_rate (we want to keep
+	 * that period small to reduce time lags).
+	 */
+	delta = 0;
+	if (base_rate < balanced_rate) {
+		if (base_rate < pos_rate)
+			delta = min(balanced_rate, pos_rate) - base_rate;
+	} else {
+		if (base_rate > pos_rate)
+			delta = base_rate - max(balanced_rate, pos_rate);
+	}
+
+	/*
+	 * Don't pursue 100% rate matching. It's impossible since the balanced
+	 * rate itself is constantly fluctuating. So decrease the track speed
+	 * when it gets close to the target. Helps eliminate pointless tremors.
+	 */
+	delta >>= base_rate / (8 * delta + 1);
+	/*
+	 * Limit the tracking speed to avoid overshooting.
+	 */
+	delta = (delta + 7) / 8;
+
+	if (base_rate < balanced_rate)
+		base_rate += delta;
+	else
+		base_rate -= delta;
+
+	bdi->dirty_ratelimit = max(base_rate, 1);
+}
+
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
 			    unsigned long bg_thresh,
@@ -783,6 +881,7 @@ void __bdi_update_bandwidth(struct backi
 {
 	unsigned long now = jiffies;
 	unsigned long elapsed = now - bdi->bw_time_stamp;
+	unsigned long dirtied;
 	unsigned long written;
 
 	/*
@@ -791,6 +890,7 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed < BANDWIDTH_INTERVAL)
 		return;
 
+	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
 
 	/*
@@ -800,12 +900,16 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
 		goto snapshot;
 
-	if (thresh)
+	if (thresh) {
 		global_update_bandwidth(thresh, dirty, now);
-
+		bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
+					   bdi_thresh, bdi_dirty,
+					   dirtied, elapsed);
+	}
 	bdi_update_write_bandwidth(bdi, elapsed, written);
 
 snapshot:
+	bdi->dirtied_stamp = dirtied;
 	bdi->written_stamp = written;
 	bdi->bw_time_stamp = now;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-16  2:20 ` Wu Fengguang
@ 2011-08-16  2:20   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: per-task-ratelimit --]
[-- Type: text/plain, Size: 7450 bytes --]

Add two fields to task_struct.

1) account dirtied pages in the individual tasks, for accuracy
2) per-task balance_dirty_pages() call intervals, for flexibility

The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
scale near-sqrt to the safety gap between dirty pages and threshold.

The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
pages at exactly the same time, each task will be assigned a large
initial nr_dirtied_pause, so that the dirty threshold will be exceeded
long before each task reached its nr_dirtied_pause and hence call
balance_dirty_pages().

The solution is to watch for the number of pages dirtied on each CPU in
between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
(3% dirty threshold), force call balance_dirty_pages() for a chance to
set bdi->dirty_exceeded. In normal situations, this safeguarding
condition is not expected to trigger at all.

peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    7 +++
 kernel/fork.c         |    3 +
 mm/page-writeback.c   |   90 ++++++++++++++++++++++------------------
 3 files changed, 61 insertions(+), 39 deletions(-)

--- linux-next.orig/include/linux/sched.h	2011-08-14 18:03:44.000000000 +0800
+++ linux-next/include/linux/sched.h	2011-08-15 10:26:05.000000000 +0800
@@ -1525,6 +1525,13 @@ struct task_struct {
 	int make_it_fail;
 #endif
 	struct prop_local_single dirties;
+	/*
+	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
+	 * balance_dirty_pages() for some dirty throttling pause
+	 */
+	int nr_dirtied;
+	int nr_dirtied_pause;
+
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
 	struct latency_record latency_record[LT_SAVECOUNT];
--- linux-next.orig/mm/page-writeback.c	2011-08-15 10:26:04.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-15 13:51:16.000000000 +0800
@@ -54,20 +54,6 @@
  */
 static long ratelimit_pages = 32;
 
-/*
- * When balance_dirty_pages decides that the caller needs to perform some
- * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than dirtied pages to ensure that reasonably
- * large amounts of I/O are submitted.
- */
-static inline long sync_writeback_pages(unsigned long dirtied)
-{
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
-	return dirtied + dirtied / 2;
-}
-
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
@@ -169,6 +155,8 @@ static void update_completion_period(voi
 	int shift = calc_period_shift();
 	prop_change_shift(&vm_completions, shift);
 	prop_change_shift(&vm_dirties, shift);
+
+	writeback_set_ratelimit();
 }
 
 int dirty_background_ratio_handler(struct ctl_table *table, int write,
@@ -930,6 +918,23 @@ static void bdi_update_bandwidth(struct 
 }
 
 /*
+ * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
+ * will look to see if it needs to start dirty throttling.
+ *
+ * If dirty_poll_interval is too low, big NUMA machines will call the expensive
+ * global_page_state() too often. So scale it near-sqrt to the safety margin
+ * (the number of pages we may dirty without exceeding the dirty limits).
+ */
+static unsigned long dirty_poll_interval(unsigned long dirty,
+					 unsigned long thresh)
+{
+	if (thresh > dirty)
+		return 1UL << (ilog2(thresh - dirty) >> 1);
+
+	return 1;
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -1072,6 +1077,9 @@ static void balance_dirty_pages(struct a
 	if (clear_dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
+	current->nr_dirtied = 0;
+	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
+
 	if (writeback_in_progress(bdi))
 		return;
 
@@ -1098,7 +1106,7 @@ void set_page_dirty_balance(struct page 
 	}
 }
 
-static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
+static DEFINE_PER_CPU(int, bdp_ratelimits);
 
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
@@ -1118,31 +1126,40 @@ void balance_dirty_pages_ratelimited_nr(
 					unsigned long nr_pages_dirtied)
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
-	unsigned long ratelimit;
-	unsigned long *p;
+	int ratelimit;
+	int *p;
 
 	if (!bdi_cap_account_dirty(bdi))
 		return;
 
-	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
-		ratelimit = 8;
+	if (!bdi->dirty_exceeded)
+		ratelimit = current->nr_dirtied_pause;
+	else
+		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
+
+	current->nr_dirtied += nr_pages_dirtied;
 
+	preempt_disable();
 	/*
-	 * Check the rate limiting. Also, we do not want to throttle real-time
-	 * tasks in balance_dirty_pages(). Period.
+	 * This prevents one CPU to accumulate too many dirtied pages without
+	 * calling into balance_dirty_pages(), which can happen when there are
+	 * 1000+ tasks, all of them start dirtying pages at exactly the same
+	 * time, hence all honoured too large initial task->nr_dirtied_pause.
 	 */
-	preempt_disable();
 	p =  &__get_cpu_var(bdp_ratelimits);
-	*p += nr_pages_dirtied;
-	if (unlikely(*p >= ratelimit)) {
-		ratelimit = sync_writeback_pages(*p);
+	if (unlikely(current->nr_dirtied >= ratelimit))
 		*p = 0;
-		preempt_enable();
-		balance_dirty_pages(mapping, ratelimit);
-		return;
+	else {
+		*p += nr_pages_dirtied;
+		if (unlikely(*p >= ratelimit_pages)) {
+			*p = 0;
+			ratelimit = 0;
+		}
 	}
 	preempt_enable();
+
+	if (unlikely(current->nr_dirtied >= ratelimit))
+		balance_dirty_pages(mapping, current->nr_dirtied);
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
@@ -1237,22 +1254,17 @@ void laptop_sync_completion(void)
  *
  * Here we set ratelimit_pages to a level which ensures that when all CPUs are
  * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
- * thresholds before writeback cuts in.
- *
- * But the limit should not be set too high.  Because it also controls the
- * amount of memory which the balance_dirty_pages() caller has to write back.
- * If this is too large then the caller will block on the IO queue all the
- * time.  So limit it to four megabytes - the balance_dirty_pages() caller
- * will write six megabyte chunks, max.
+ * thresholds.
  */
 
 void writeback_set_ratelimit(void)
 {
-	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
+	unsigned long background_thresh;
+	unsigned long dirty_thresh;
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+	ratelimit_pages = dirty_thresh / (num_online_cpus() * 32);
 	if (ratelimit_pages < 16)
 		ratelimit_pages = 16;
-	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
-		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
 }
 
 static int __cpuinit
--- linux-next.orig/kernel/fork.c	2011-08-14 18:03:44.000000000 +0800
+++ linux-next/kernel/fork.c	2011-08-15 10:26:05.000000000 +0800
@@ -1301,6 +1301,9 @@ static struct task_struct *copy_process(
 	p->pdeath_signal = 0;
 	p->exit_state = 0;
 
+	p->nr_dirtied = 0;
+	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
+
 	/*
 	 * Ok, make it visible to the rest of the system.
 	 * We dont wake it up yet.



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-16  2:20   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: per-task-ratelimit --]
[-- Type: text/plain, Size: 7753 bytes --]

Add two fields to task_struct.

1) account dirtied pages in the individual tasks, for accuracy
2) per-task balance_dirty_pages() call intervals, for flexibility

The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
scale near-sqrt to the safety gap between dirty pages and threshold.

The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
pages at exactly the same time, each task will be assigned a large
initial nr_dirtied_pause, so that the dirty threshold will be exceeded
long before each task reached its nr_dirtied_pause and hence call
balance_dirty_pages().

The solution is to watch for the number of pages dirtied on each CPU in
between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
(3% dirty threshold), force call balance_dirty_pages() for a chance to
set bdi->dirty_exceeded. In normal situations, this safeguarding
condition is not expected to trigger at all.

peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    7 +++
 kernel/fork.c         |    3 +
 mm/page-writeback.c   |   90 ++++++++++++++++++++++------------------
 3 files changed, 61 insertions(+), 39 deletions(-)

--- linux-next.orig/include/linux/sched.h	2011-08-14 18:03:44.000000000 +0800
+++ linux-next/include/linux/sched.h	2011-08-15 10:26:05.000000000 +0800
@@ -1525,6 +1525,13 @@ struct task_struct {
 	int make_it_fail;
 #endif
 	struct prop_local_single dirties;
+	/*
+	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
+	 * balance_dirty_pages() for some dirty throttling pause
+	 */
+	int nr_dirtied;
+	int nr_dirtied_pause;
+
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
 	struct latency_record latency_record[LT_SAVECOUNT];
--- linux-next.orig/mm/page-writeback.c	2011-08-15 10:26:04.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-15 13:51:16.000000000 +0800
@@ -54,20 +54,6 @@
  */
 static long ratelimit_pages = 32;
 
-/*
- * When balance_dirty_pages decides that the caller needs to perform some
- * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than dirtied pages to ensure that reasonably
- * large amounts of I/O are submitted.
- */
-static inline long sync_writeback_pages(unsigned long dirtied)
-{
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
-	return dirtied + dirtied / 2;
-}
-
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
@@ -169,6 +155,8 @@ static void update_completion_period(voi
 	int shift = calc_period_shift();
 	prop_change_shift(&vm_completions, shift);
 	prop_change_shift(&vm_dirties, shift);
+
+	writeback_set_ratelimit();
 }
 
 int dirty_background_ratio_handler(struct ctl_table *table, int write,
@@ -930,6 +918,23 @@ static void bdi_update_bandwidth(struct 
 }
 
 /*
+ * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
+ * will look to see if it needs to start dirty throttling.
+ *
+ * If dirty_poll_interval is too low, big NUMA machines will call the expensive
+ * global_page_state() too often. So scale it near-sqrt to the safety margin
+ * (the number of pages we may dirty without exceeding the dirty limits).
+ */
+static unsigned long dirty_poll_interval(unsigned long dirty,
+					 unsigned long thresh)
+{
+	if (thresh > dirty)
+		return 1UL << (ilog2(thresh - dirty) >> 1);
+
+	return 1;
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -1072,6 +1077,9 @@ static void balance_dirty_pages(struct a
 	if (clear_dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
+	current->nr_dirtied = 0;
+	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
+
 	if (writeback_in_progress(bdi))
 		return;
 
@@ -1098,7 +1106,7 @@ void set_page_dirty_balance(struct page 
 	}
 }
 
-static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
+static DEFINE_PER_CPU(int, bdp_ratelimits);
 
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
@@ -1118,31 +1126,40 @@ void balance_dirty_pages_ratelimited_nr(
 					unsigned long nr_pages_dirtied)
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
-	unsigned long ratelimit;
-	unsigned long *p;
+	int ratelimit;
+	int *p;
 
 	if (!bdi_cap_account_dirty(bdi))
 		return;
 
-	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
-		ratelimit = 8;
+	if (!bdi->dirty_exceeded)
+		ratelimit = current->nr_dirtied_pause;
+	else
+		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
+
+	current->nr_dirtied += nr_pages_dirtied;
 
+	preempt_disable();
 	/*
-	 * Check the rate limiting. Also, we do not want to throttle real-time
-	 * tasks in balance_dirty_pages(). Period.
+	 * This prevents one CPU to accumulate too many dirtied pages without
+	 * calling into balance_dirty_pages(), which can happen when there are
+	 * 1000+ tasks, all of them start dirtying pages at exactly the same
+	 * time, hence all honoured too large initial task->nr_dirtied_pause.
 	 */
-	preempt_disable();
 	p =  &__get_cpu_var(bdp_ratelimits);
-	*p += nr_pages_dirtied;
-	if (unlikely(*p >= ratelimit)) {
-		ratelimit = sync_writeback_pages(*p);
+	if (unlikely(current->nr_dirtied >= ratelimit))
 		*p = 0;
-		preempt_enable();
-		balance_dirty_pages(mapping, ratelimit);
-		return;
+	else {
+		*p += nr_pages_dirtied;
+		if (unlikely(*p >= ratelimit_pages)) {
+			*p = 0;
+			ratelimit = 0;
+		}
 	}
 	preempt_enable();
+
+	if (unlikely(current->nr_dirtied >= ratelimit))
+		balance_dirty_pages(mapping, current->nr_dirtied);
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
@@ -1237,22 +1254,17 @@ void laptop_sync_completion(void)
  *
  * Here we set ratelimit_pages to a level which ensures that when all CPUs are
  * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
- * thresholds before writeback cuts in.
- *
- * But the limit should not be set too high.  Because it also controls the
- * amount of memory which the balance_dirty_pages() caller has to write back.
- * If this is too large then the caller will block on the IO queue all the
- * time.  So limit it to four megabytes - the balance_dirty_pages() caller
- * will write six megabyte chunks, max.
+ * thresholds.
  */
 
 void writeback_set_ratelimit(void)
 {
-	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
+	unsigned long background_thresh;
+	unsigned long dirty_thresh;
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+	ratelimit_pages = dirty_thresh / (num_online_cpus() * 32);
 	if (ratelimit_pages < 16)
 		ratelimit_pages = 16;
-	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
-		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
 }
 
 static int __cpuinit
--- linux-next.orig/kernel/fork.c	2011-08-14 18:03:44.000000000 +0800
+++ linux-next/kernel/fork.c	2011-08-15 10:26:05.000000000 +0800
@@ -1301,6 +1301,9 @@ static struct task_struct *copy_process(
 	p->pdeath_signal = 0;
 	p->exit_state = 0;
 
+	p->nr_dirtied = 0;
+	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
+
 	/*
 	 * Ok, make it visible to the rest of the system.
 	 * We dont wake it up yet.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-16  2:20 ` Wu Fengguang
  (?)
@ 2011-08-16  2:20   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-ioless-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 15084 bytes --]

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

RATIONALS
=========

- disk seeks on concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)

  With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
  from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".

  * "CPU usage has dropped by ~55%", "it certainly appears that most of
    the CPU time saving comes from the removal of contention on the
    inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
    cacheline bouncing, because the new code is able to call much less
    frequently into balance_dirty_pages() and hence access the global
    page states)

  * the user space "App overhead" is reduced by 20%, by avoiding the
    cacheline pollution by the complex writeback code path

  * "for a ~5% throughput reduction", "the number of write IOs have
    dropped by ~25%", and the elapsed time reduced from 41:42.17 to
    40:53.23.

  * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
    and improves IO throughput from 38MB/s to 42MB/s.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

  Now it's possible to increase writeback chunk size proportional to the
  disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
  the larger writeback size dramatically reduces the seek count to 1/10
  (far beyond my expectation) and improves the write throughput by 24%.

- long block time in balance_dirty_pages() hurts desktop responsiveness

  Many of us may have the experience: it often takes a couple of seconds
  or even long time to stop a heavy writing dd/cp/tar command with
  Ctrl-C or "kill -9".

- IO pipeline broken by bumpy write() progress

  There are a broad class of "loop {read(buf); write(buf);}" applications
  whose read() pipeline will be under-utilized or even come to a stop if
  the write()s have long latencies _or_ don't progress in a constant rate.
  The current threshold based throttling inherently transfers the large
  low level IO completion fluctuations to bumpy application write()s,
  and further deteriorates with increasing number of dirtiers and/or bdi's.

  For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
  the rsync progresses very bumpy in legacy kernel, and throughput is
  improved by 67% by this patchset. (plus the larger write chunk size,
  it will be 93% speedup).

  The new rate based throttling can support 1000+ dd's with excellent
  smoothness, low latency and low overheads.

For the above reasons, it's much better to do IO-less and low latency
pauses in balance_dirty_pages().

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than   4ms, which burns CPU power)
- avoid too large pause time (more than 200ms, which hurts responsiveness)
- avoid big fluctuations of pause times

It can control pause times at will. The default policy (in a followup
patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
in 1000-dd case.

BEHAVIOR CHANGE
===============

(1) dirty threshold

Users will notice that the applications will get throttled once crossing
the global (background + dirty)/2=15% threshold, and then balanced around
17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
memory in 1-dd case.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than 15% dirtyable memory.

(2) smoothness/responsiveness

Users will notice a more responsive system during heavy writeback.
"killall dd" will take effect instantly.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   24 ----
 mm/page-writeback.c              |  147 ++++++++---------------------
 2 files changed, 41 insertions(+), 130 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-15 14:09:01.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-16 08:50:46.000000000 +0800
@@ -250,50 +250,6 @@ static void bdi_writeout_fraction(struct
 				numerator, denominator);
 }
 
-static inline void task_dirties_fraction(struct task_struct *tsk,
-		long *numerator, long *denominator)
-{
-	prop_fraction_single(&vm_dirties, &tsk->dirties,
-				numerator, denominator);
-}
-
-/*
- * task_dirty_limit - scale down dirty throttling threshold for one task
- *
- * task specific dirty limit:
- *
- *   dirty -= (dirty/8) * p_{t}
- *
- * To protect light/slow dirtying tasks from heavier/fast ones, we start
- * throttling individual tasks before reaching the bdi dirty limit.
- * Relatively low thresholds will be allocated to heavy dirtiers. So when
- * dirty pages grow large, heavy dirtiers will be throttled first, which will
- * effectively curb the growth of dirty pages. Light dirtiers with high enough
- * dirty threshold may never get throttled.
- */
-#define TASK_LIMIT_FRACTION 8
-static unsigned long task_dirty_limit(struct task_struct *tsk,
-				       unsigned long bdi_dirty)
-{
-	long numerator, denominator;
-	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty / TASK_LIMIT_FRACTION;
-
-	task_dirties_fraction(tsk, &numerator, &denominator);
-	inv *= numerator;
-	do_div(inv, denominator);
-
-	dirty -= inv;
-
-	return max(dirty, bdi_dirty/2);
-}
-
-/* Minimum limit for any task */
-static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
-{
-	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
-}
-
 /*
  *
  */
@@ -939,29 +895,34 @@ static unsigned long dirty_poll_interval
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
- * the caller to perform writeback if the system is over `vm_dirty_ratio'.
+ * the caller to wait once crossing the (background_thresh + dirty_thresh) / 2.
  * If we're over `background_thresh' then the writeback threads are woken to
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
-	unsigned long nr_reclaimable, bdi_nr_reclaimable;
+	unsigned long nr_reclaimable;
 	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	unsigned long bdi_dirty;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long task_bdi_thresh;
-	unsigned long min_task_bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
+	long pause = 0;
 	bool dirty_exceeded = false;
-	bool clear_dirty_exceeded = true;
+	unsigned long task_ratelimit;
+	unsigned long base_rate;
+	unsigned long pos_ratio;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
 
 	for (;;) {
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
@@ -978,8 +939,6 @@ static void balance_dirty_pages(struct a
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
-		min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh);
-		task_bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
@@ -991,57 +950,41 @@ static void balance_dirty_pages(struct a
 		 * actually dirty; with m+n sitting in the percpu
 		 * deltas.
 		 */
-		if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+		if (bdi_thresh < 2 * bdi_stat_error(bdi))
+			bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
 				    bdi_stat_sum(bdi, BDI_WRITEBACK);
-		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+		else
+			bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) +
 				    bdi_stat(bdi, BDI_WRITEBACK);
-		}
 
-		/*
-		 * The bdi thresh is somehow "soft" limit derived from the
-		 * global "hard" limit. The former helps to prevent heavy IO
-		 * bdi or process from holding back light ones; The latter is
-		 * the last resort safeguard.
-		 */
-		dirty_exceeded = (bdi_dirty > task_bdi_thresh) ||
+		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
 				  (nr_dirty > dirty_thresh);
-		clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) &&
-					(nr_dirty <= dirty_thresh);
-
-		if (!dirty_exceeded)
-			break;
-
-		if (!bdi->dirty_exceeded)
+		if (dirty_exceeded && !bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
 		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
 				     nr_dirty, bdi_thresh, bdi_dirty,
 				     start_time);
 
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_balance_dirty_start(bdi);
-		if (bdi_nr_reclaimable > task_bdi_thresh) {
-			pages_written += writeback_inodes_wb(&bdi->wb,
-							     write_chunk);
-			trace_balance_dirty_written(bdi, pages_written);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
+		if (unlikely(!writeback_in_progress(bdi)))
+			bdi_start_background_writeback(bdi);
+
+		base_rate = bdi->dirty_ratelimit;
+		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
+					       background_thresh, nr_dirty,
+					       bdi_thresh, bdi_dirty);
+		if (unlikely(pos_ratio == 0)) {
+			pause = MAX_PAUSE;
+			goto pause;
 		}
+		task_ratelimit = (u64)base_rate *
+					pos_ratio >> RATELIMIT_CALC_SHIFT;
+		pause = (HZ * pages_dirtied) / (task_ratelimit | 1);
+		pause = min(pause, MAX_PAUSE);
+
+pause:
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
-		trace_balance_dirty_wait(bdi);
 
 		dirty_thresh = hard_dirty_limit(dirty_thresh);
 		/*
@@ -1051,8 +994,7 @@ static void balance_dirty_pages(struct a
 		 * (b) the pause time limit makes the dirtiers more responsive.
 		 */
 		if (nr_dirty < dirty_thresh +
-			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
-		    time_after(jiffies, start_time + MAX_PAUSE))
+			       dirty_thresh / DIRTY_MAXPAUSE_AREA)
 			break;
 		/*
 		 * pass-good area. When some bdi gets blocked (eg. NFS server
@@ -1065,18 +1007,9 @@ static void balance_dirty_pages(struct a
 			       dirty_thresh / DIRTY_PASSGOOD_AREA &&
 		    bdi_dirty < bdi_thresh)
 			break;
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
 	}
 
-	/* Clear dirty_exceeded flag only when no task can exceed the limit */
-	if (clear_dirty_exceeded && bdi->dirty_exceeded)
+	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
 	current->nr_dirtied = 0;
@@ -1093,8 +1026,10 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	if (laptop_mode)
+		return;
+
+	if (nr_reclaimable > background_thresh)
 		bdi_start_background_writeback(bdi);
 }
 
--- linux-next.orig/include/trace/events/writeback.h	2011-08-15 13:59:09.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-08-16 08:50:46.000000000 +0800
@@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
 DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
 DEFINE_WRITEBACK_EVENT(writeback_thread_start);
 DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
-DEFINE_WRITEBACK_EVENT(balance_dirty_start);
-DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
-
-TRACE_EVENT(balance_dirty_written,
-
-	TP_PROTO(struct backing_dev_info *bdi, int written),
-
-	TP_ARGS(bdi, written),
-
-	TP_STRUCT__entry(
-		__array(char,	name, 32)
-		__field(int,	written)
-	),
-
-	TP_fast_assign(
-		strncpy(__entry->name, dev_name(bdi->dev), 32);
-		__entry->written = written;
-	),
-
-	TP_printk("bdi %s written %d",
-		  __entry->name,
-		  __entry->written
-	)
-);
 
 DECLARE_EVENT_CLASS(wbc_class,
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-16  2:20   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-ioless-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 15387 bytes --]

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

RATIONALS
=========

- disk seeks on concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)

  With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
  from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".

  * "CPU usage has dropped by ~55%", "it certainly appears that most of
    the CPU time saving comes from the removal of contention on the
    inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
    cacheline bouncing, because the new code is able to call much less
    frequently into balance_dirty_pages() and hence access the global
    page states)

  * the user space "App overhead" is reduced by 20%, by avoiding the
    cacheline pollution by the complex writeback code path

  * "for a ~5% throughput reduction", "the number of write IOs have
    dropped by ~25%", and the elapsed time reduced from 41:42.17 to
    40:53.23.

  * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
    and improves IO throughput from 38MB/s to 42MB/s.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

  Now it's possible to increase writeback chunk size proportional to the
  disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
  the larger writeback size dramatically reduces the seek count to 1/10
  (far beyond my expectation) and improves the write throughput by 24%.

- long block time in balance_dirty_pages() hurts desktop responsiveness

  Many of us may have the experience: it often takes a couple of seconds
  or even long time to stop a heavy writing dd/cp/tar command with
  Ctrl-C or "kill -9".

- IO pipeline broken by bumpy write() progress

  There are a broad class of "loop {read(buf); write(buf);}" applications
  whose read() pipeline will be under-utilized or even come to a stop if
  the write()s have long latencies _or_ don't progress in a constant rate.
  The current threshold based throttling inherently transfers the large
  low level IO completion fluctuations to bumpy application write()s,
  and further deteriorates with increasing number of dirtiers and/or bdi's.

  For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
  the rsync progresses very bumpy in legacy kernel, and throughput is
  improved by 67% by this patchset. (plus the larger write chunk size,
  it will be 93% speedup).

  The new rate based throttling can support 1000+ dd's with excellent
  smoothness, low latency and low overheads.

For the above reasons, it's much better to do IO-less and low latency
pauses in balance_dirty_pages().

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than   4ms, which burns CPU power)
- avoid too large pause time (more than 200ms, which hurts responsiveness)
- avoid big fluctuations of pause times

It can control pause times at will. The default policy (in a followup
patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
in 1000-dd case.

BEHAVIOR CHANGE
===============

(1) dirty threshold

Users will notice that the applications will get throttled once crossing
the global (background + dirty)/2=15% threshold, and then balanced around
17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
memory in 1-dd case.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than 15% dirtyable memory.

(2) smoothness/responsiveness

Users will notice a more responsive system during heavy writeback.
"killall dd" will take effect instantly.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   24 ----
 mm/page-writeback.c              |  147 ++++++++---------------------
 2 files changed, 41 insertions(+), 130 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-15 14:09:01.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-16 08:50:46.000000000 +0800
@@ -250,50 +250,6 @@ static void bdi_writeout_fraction(struct
 				numerator, denominator);
 }
 
-static inline void task_dirties_fraction(struct task_struct *tsk,
-		long *numerator, long *denominator)
-{
-	prop_fraction_single(&vm_dirties, &tsk->dirties,
-				numerator, denominator);
-}
-
-/*
- * task_dirty_limit - scale down dirty throttling threshold for one task
- *
- * task specific dirty limit:
- *
- *   dirty -= (dirty/8) * p_{t}
- *
- * To protect light/slow dirtying tasks from heavier/fast ones, we start
- * throttling individual tasks before reaching the bdi dirty limit.
- * Relatively low thresholds will be allocated to heavy dirtiers. So when
- * dirty pages grow large, heavy dirtiers will be throttled first, which will
- * effectively curb the growth of dirty pages. Light dirtiers with high enough
- * dirty threshold may never get throttled.
- */
-#define TASK_LIMIT_FRACTION 8
-static unsigned long task_dirty_limit(struct task_struct *tsk,
-				       unsigned long bdi_dirty)
-{
-	long numerator, denominator;
-	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty / TASK_LIMIT_FRACTION;
-
-	task_dirties_fraction(tsk, &numerator, &denominator);
-	inv *= numerator;
-	do_div(inv, denominator);
-
-	dirty -= inv;
-
-	return max(dirty, bdi_dirty/2);
-}
-
-/* Minimum limit for any task */
-static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
-{
-	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
-}
-
 /*
  *
  */
@@ -939,29 +895,34 @@ static unsigned long dirty_poll_interval
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
- * the caller to perform writeback if the system is over `vm_dirty_ratio'.
+ * the caller to wait once crossing the (background_thresh + dirty_thresh) / 2.
  * If we're over `background_thresh' then the writeback threads are woken to
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
-	unsigned long nr_reclaimable, bdi_nr_reclaimable;
+	unsigned long nr_reclaimable;
 	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	unsigned long bdi_dirty;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long task_bdi_thresh;
-	unsigned long min_task_bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
+	long pause = 0;
 	bool dirty_exceeded = false;
-	bool clear_dirty_exceeded = true;
+	unsigned long task_ratelimit;
+	unsigned long base_rate;
+	unsigned long pos_ratio;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
 
 	for (;;) {
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
@@ -978,8 +939,6 @@ static void balance_dirty_pages(struct a
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
-		min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh);
-		task_bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
@@ -991,57 +950,41 @@ static void balance_dirty_pages(struct a
 		 * actually dirty; with m+n sitting in the percpu
 		 * deltas.
 		 */
-		if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+		if (bdi_thresh < 2 * bdi_stat_error(bdi))
+			bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
 				    bdi_stat_sum(bdi, BDI_WRITEBACK);
-		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+		else
+			bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) +
 				    bdi_stat(bdi, BDI_WRITEBACK);
-		}
 
-		/*
-		 * The bdi thresh is somehow "soft" limit derived from the
-		 * global "hard" limit. The former helps to prevent heavy IO
-		 * bdi or process from holding back light ones; The latter is
-		 * the last resort safeguard.
-		 */
-		dirty_exceeded = (bdi_dirty > task_bdi_thresh) ||
+		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
 				  (nr_dirty > dirty_thresh);
-		clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) &&
-					(nr_dirty <= dirty_thresh);
-
-		if (!dirty_exceeded)
-			break;
-
-		if (!bdi->dirty_exceeded)
+		if (dirty_exceeded && !bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
 		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
 				     nr_dirty, bdi_thresh, bdi_dirty,
 				     start_time);
 
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_balance_dirty_start(bdi);
-		if (bdi_nr_reclaimable > task_bdi_thresh) {
-			pages_written += writeback_inodes_wb(&bdi->wb,
-							     write_chunk);
-			trace_balance_dirty_written(bdi, pages_written);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
+		if (unlikely(!writeback_in_progress(bdi)))
+			bdi_start_background_writeback(bdi);
+
+		base_rate = bdi->dirty_ratelimit;
+		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
+					       background_thresh, nr_dirty,
+					       bdi_thresh, bdi_dirty);
+		if (unlikely(pos_ratio == 0)) {
+			pause = MAX_PAUSE;
+			goto pause;
 		}
+		task_ratelimit = (u64)base_rate *
+					pos_ratio >> RATELIMIT_CALC_SHIFT;
+		pause = (HZ * pages_dirtied) / (task_ratelimit | 1);
+		pause = min(pause, MAX_PAUSE);
+
+pause:
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
-		trace_balance_dirty_wait(bdi);
 
 		dirty_thresh = hard_dirty_limit(dirty_thresh);
 		/*
@@ -1051,8 +994,7 @@ static void balance_dirty_pages(struct a
 		 * (b) the pause time limit makes the dirtiers more responsive.
 		 */
 		if (nr_dirty < dirty_thresh +
-			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
-		    time_after(jiffies, start_time + MAX_PAUSE))
+			       dirty_thresh / DIRTY_MAXPAUSE_AREA)
 			break;
 		/*
 		 * pass-good area. When some bdi gets blocked (eg. NFS server
@@ -1065,18 +1007,9 @@ static void balance_dirty_pages(struct a
 			       dirty_thresh / DIRTY_PASSGOOD_AREA &&
 		    bdi_dirty < bdi_thresh)
 			break;
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
 	}
 
-	/* Clear dirty_exceeded flag only when no task can exceed the limit */
-	if (clear_dirty_exceeded && bdi->dirty_exceeded)
+	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
 	current->nr_dirtied = 0;
@@ -1093,8 +1026,10 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	if (laptop_mode)
+		return;
+
+	if (nr_reclaimable > background_thresh)
 		bdi_start_background_writeback(bdi);
 }
 
--- linux-next.orig/include/trace/events/writeback.h	2011-08-15 13:59:09.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-08-16 08:50:46.000000000 +0800
@@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
 DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
 DEFINE_WRITEBACK_EVENT(writeback_thread_start);
 DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
-DEFINE_WRITEBACK_EVENT(balance_dirty_start);
-DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
-
-TRACE_EVENT(balance_dirty_written,
-
-	TP_PROTO(struct backing_dev_info *bdi, int written),
-
-	TP_ARGS(bdi, written),
-
-	TP_STRUCT__entry(
-		__array(char,	name, 32)
-		__field(int,	written)
-	),
-
-	TP_fast_assign(
-		strncpy(__entry->name, dev_name(bdi->dev), 32);
-		__entry->written = written;
-	),
-
-	TP_printk("bdi %s written %d",
-		  __entry->name,
-		  __entry->written
-	)
-);
 
 DECLARE_EVENT_CLASS(wbc_class,
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-16  2:20   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-ioless-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 15387 bytes --]

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

RATIONALS
=========

- disk seeks on concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)

  With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
  from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".

  * "CPU usage has dropped by ~55%", "it certainly appears that most of
    the CPU time saving comes from the removal of contention on the
    inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
    cacheline bouncing, because the new code is able to call much less
    frequently into balance_dirty_pages() and hence access the global
    page states)

  * the user space "App overhead" is reduced by 20%, by avoiding the
    cacheline pollution by the complex writeback code path

  * "for a ~5% throughput reduction", "the number of write IOs have
    dropped by ~25%", and the elapsed time reduced from 41:42.17 to
    40:53.23.

  * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
    and improves IO throughput from 38MB/s to 42MB/s.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

  Now it's possible to increase writeback chunk size proportional to the
  disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
  the larger writeback size dramatically reduces the seek count to 1/10
  (far beyond my expectation) and improves the write throughput by 24%.

- long block time in balance_dirty_pages() hurts desktop responsiveness

  Many of us may have the experience: it often takes a couple of seconds
  or even long time to stop a heavy writing dd/cp/tar command with
  Ctrl-C or "kill -9".

- IO pipeline broken by bumpy write() progress

  There are a broad class of "loop {read(buf); write(buf);}" applications
  whose read() pipeline will be under-utilized or even come to a stop if
  the write()s have long latencies _or_ don't progress in a constant rate.
  The current threshold based throttling inherently transfers the large
  low level IO completion fluctuations to bumpy application write()s,
  and further deteriorates with increasing number of dirtiers and/or bdi's.

  For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
  the rsync progresses very bumpy in legacy kernel, and throughput is
  improved by 67% by this patchset. (plus the larger write chunk size,
  it will be 93% speedup).

  The new rate based throttling can support 1000+ dd's with excellent
  smoothness, low latency and low overheads.

For the above reasons, it's much better to do IO-less and low latency
pauses in balance_dirty_pages().

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than   4ms, which burns CPU power)
- avoid too large pause time (more than 200ms, which hurts responsiveness)
- avoid big fluctuations of pause times

It can control pause times at will. The default policy (in a followup
patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
in 1000-dd case.

BEHAVIOR CHANGE
===============

(1) dirty threshold

Users will notice that the applications will get throttled once crossing
the global (background + dirty)/2=15% threshold, and then balanced around
17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
memory in 1-dd case.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than 15% dirtyable memory.

(2) smoothness/responsiveness

Users will notice a more responsive system during heavy writeback.
"killall dd" will take effect instantly.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   24 ----
 mm/page-writeback.c              |  147 ++++++++---------------------
 2 files changed, 41 insertions(+), 130 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-15 14:09:01.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-16 08:50:46.000000000 +0800
@@ -250,50 +250,6 @@ static void bdi_writeout_fraction(struct
 				numerator, denominator);
 }
 
-static inline void task_dirties_fraction(struct task_struct *tsk,
-		long *numerator, long *denominator)
-{
-	prop_fraction_single(&vm_dirties, &tsk->dirties,
-				numerator, denominator);
-}
-
-/*
- * task_dirty_limit - scale down dirty throttling threshold for one task
- *
- * task specific dirty limit:
- *
- *   dirty -= (dirty/8) * p_{t}
- *
- * To protect light/slow dirtying tasks from heavier/fast ones, we start
- * throttling individual tasks before reaching the bdi dirty limit.
- * Relatively low thresholds will be allocated to heavy dirtiers. So when
- * dirty pages grow large, heavy dirtiers will be throttled first, which will
- * effectively curb the growth of dirty pages. Light dirtiers with high enough
- * dirty threshold may never get throttled.
- */
-#define TASK_LIMIT_FRACTION 8
-static unsigned long task_dirty_limit(struct task_struct *tsk,
-				       unsigned long bdi_dirty)
-{
-	long numerator, denominator;
-	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty / TASK_LIMIT_FRACTION;
-
-	task_dirties_fraction(tsk, &numerator, &denominator);
-	inv *= numerator;
-	do_div(inv, denominator);
-
-	dirty -= inv;
-
-	return max(dirty, bdi_dirty/2);
-}
-
-/* Minimum limit for any task */
-static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
-{
-	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
-}
-
 /*
  *
  */
@@ -939,29 +895,34 @@ static unsigned long dirty_poll_interval
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
- * the caller to perform writeback if the system is over `vm_dirty_ratio'.
+ * the caller to wait once crossing the (background_thresh + dirty_thresh) / 2.
  * If we're over `background_thresh' then the writeback threads are woken to
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
-	unsigned long nr_reclaimable, bdi_nr_reclaimable;
+	unsigned long nr_reclaimable;
 	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	unsigned long bdi_dirty;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long task_bdi_thresh;
-	unsigned long min_task_bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
+	long pause = 0;
 	bool dirty_exceeded = false;
-	bool clear_dirty_exceeded = true;
+	unsigned long task_ratelimit;
+	unsigned long base_rate;
+	unsigned long pos_ratio;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
 
 	for (;;) {
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
@@ -978,8 +939,6 @@ static void balance_dirty_pages(struct a
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
-		min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh);
-		task_bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
@@ -991,57 +950,41 @@ static void balance_dirty_pages(struct a
 		 * actually dirty; with m+n sitting in the percpu
 		 * deltas.
 		 */
-		if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+		if (bdi_thresh < 2 * bdi_stat_error(bdi))
+			bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
 				    bdi_stat_sum(bdi, BDI_WRITEBACK);
-		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+		else
+			bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) +
 				    bdi_stat(bdi, BDI_WRITEBACK);
-		}
 
-		/*
-		 * The bdi thresh is somehow "soft" limit derived from the
-		 * global "hard" limit. The former helps to prevent heavy IO
-		 * bdi or process from holding back light ones; The latter is
-		 * the last resort safeguard.
-		 */
-		dirty_exceeded = (bdi_dirty > task_bdi_thresh) ||
+		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
 				  (nr_dirty > dirty_thresh);
-		clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) &&
-					(nr_dirty <= dirty_thresh);
-
-		if (!dirty_exceeded)
-			break;
-
-		if (!bdi->dirty_exceeded)
+		if (dirty_exceeded && !bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
 		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
 				     nr_dirty, bdi_thresh, bdi_dirty,
 				     start_time);
 
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_balance_dirty_start(bdi);
-		if (bdi_nr_reclaimable > task_bdi_thresh) {
-			pages_written += writeback_inodes_wb(&bdi->wb,
-							     write_chunk);
-			trace_balance_dirty_written(bdi, pages_written);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
+		if (unlikely(!writeback_in_progress(bdi)))
+			bdi_start_background_writeback(bdi);
+
+		base_rate = bdi->dirty_ratelimit;
+		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
+					       background_thresh, nr_dirty,
+					       bdi_thresh, bdi_dirty);
+		if (unlikely(pos_ratio == 0)) {
+			pause = MAX_PAUSE;
+			goto pause;
 		}
+		task_ratelimit = (u64)base_rate *
+					pos_ratio >> RATELIMIT_CALC_SHIFT;
+		pause = (HZ * pages_dirtied) / (task_ratelimit | 1);
+		pause = min(pause, MAX_PAUSE);
+
+pause:
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
-		trace_balance_dirty_wait(bdi);
 
 		dirty_thresh = hard_dirty_limit(dirty_thresh);
 		/*
@@ -1051,8 +994,7 @@ static void balance_dirty_pages(struct a
 		 * (b) the pause time limit makes the dirtiers more responsive.
 		 */
 		if (nr_dirty < dirty_thresh +
-			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
-		    time_after(jiffies, start_time + MAX_PAUSE))
+			       dirty_thresh / DIRTY_MAXPAUSE_AREA)
 			break;
 		/*
 		 * pass-good area. When some bdi gets blocked (eg. NFS server
@@ -1065,18 +1007,9 @@ static void balance_dirty_pages(struct a
 			       dirty_thresh / DIRTY_PASSGOOD_AREA &&
 		    bdi_dirty < bdi_thresh)
 			break;
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
 	}
 
-	/* Clear dirty_exceeded flag only when no task can exceed the limit */
-	if (clear_dirty_exceeded && bdi->dirty_exceeded)
+	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
 	current->nr_dirtied = 0;
@@ -1093,8 +1026,10 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	if (laptop_mode)
+		return;
+
+	if (nr_reclaimable > background_thresh)
 		bdi_start_background_writeback(bdi);
 }
 
--- linux-next.orig/include/trace/events/writeback.h	2011-08-15 13:59:09.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-08-16 08:50:46.000000000 +0800
@@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
 DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
 DEFINE_WRITEBACK_EVENT(writeback_thread_start);
 DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
-DEFINE_WRITEBACK_EVENT(balance_dirty_start);
-DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
-
-TRACE_EVENT(balance_dirty_written,
-
-	TP_PROTO(struct backing_dev_info *bdi, int written),
-
-	TP_ARGS(bdi, written),
-
-	TP_STRUCT__entry(
-		__array(char,	name, 32)
-		__field(int,	written)
-	),
-
-	TP_fast_assign(
-		strncpy(__entry->name, dev_name(bdi->dev), 32);
-		__entry->written = written;
-	),
-
-	TP_printk("bdi %s written %d",
-		  __entry->name,
-		  __entry->written
-	)
-);
 
 DECLARE_EVENT_CLASS(wbc_class,
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-16  2:20   ` Wu Fengguang
@ 2011-08-16  7:17     ` Andrea Righi
  -1 siblings, 0 replies; 98+ messages in thread
From: Andrea Righi @ 2011-08-16  7:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, linux-mm, LKML

On Tue, Aug 16, 2011 at 10:20:10AM +0800, Wu Fengguang wrote:
> Add two fields to task_struct.
> 
> 1) account dirtied pages in the individual tasks, for accuracy
> 2) per-task balance_dirty_pages() call intervals, for flexibility
> 
> The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> scale near-sqrt to the safety gap between dirty pages and threshold.
> 
> The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
> pages at exactly the same time, each task will be assigned a large
> initial nr_dirtied_pause, so that the dirty threshold will be exceeded
> long before each task reached its nr_dirtied_pause and hence call
> balance_dirty_pages().
> 
> The solution is to watch for the number of pages dirtied on each CPU in
> between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
> (3% dirty threshold), force call balance_dirty_pages() for a chance to
> set bdi->dirty_exceeded. In normal situations, this safeguarding
> condition is not expected to trigger at all.
> 
> peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case
> 
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Reviewed-by: Andrea Righi <andrea@betterlinux.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/sched.h |    7 +++
>  kernel/fork.c         |    3 +
>  mm/page-writeback.c   |   90 ++++++++++++++++++++++------------------
>  3 files changed, 61 insertions(+), 39 deletions(-)
> 
> --- linux-next.orig/include/linux/sched.h	2011-08-14 18:03:44.000000000 +0800
> +++ linux-next/include/linux/sched.h	2011-08-15 10:26:05.000000000 +0800
> @@ -1525,6 +1525,13 @@ struct task_struct {
>  	int make_it_fail;
>  #endif
>  	struct prop_local_single dirties;
> +	/*
> +	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
> +	 * balance_dirty_pages() for some dirty throttling pause
> +	 */
> +	int nr_dirtied;
> +	int nr_dirtied_pause;
> +
>  #ifdef CONFIG_LATENCYTOP
>  	int latency_record_count;
>  	struct latency_record latency_record[LT_SAVECOUNT];
> --- linux-next.orig/mm/page-writeback.c	2011-08-15 10:26:04.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-15 13:51:16.000000000 +0800
> @@ -54,20 +54,6 @@
>   */
>  static long ratelimit_pages = 32;
>  
> -/*
> - * When balance_dirty_pages decides that the caller needs to perform some
> - * non-background writeback, this is how many pages it will attempt to write.
> - * It should be somewhat larger than dirtied pages to ensure that reasonably
> - * large amounts of I/O are submitted.
> - */
> -static inline long sync_writeback_pages(unsigned long dirtied)
> -{
> -	if (dirtied < ratelimit_pages)
> -		dirtied = ratelimit_pages;
> -
> -	return dirtied + dirtied / 2;
> -}
> -
>  /* The following parameters are exported via /proc/sys/vm */
>  
>  /*
> @@ -169,6 +155,8 @@ static void update_completion_period(voi
>  	int shift = calc_period_shift();
>  	prop_change_shift(&vm_completions, shift);
>  	prop_change_shift(&vm_dirties, shift);
> +
> +	writeback_set_ratelimit();
>  }
>  
>  int dirty_background_ratio_handler(struct ctl_table *table, int write,
> @@ -930,6 +918,23 @@ static void bdi_update_bandwidth(struct 
>  }
>  
>  /*
> + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
> + * will look to see if it needs to start dirty throttling.
> + *
> + * If dirty_poll_interval is too low, big NUMA machines will call the expensive
> + * global_page_state() too often. So scale it near-sqrt to the safety margin
> + * (the number of pages we may dirty without exceeding the dirty limits).
> + */
> +static unsigned long dirty_poll_interval(unsigned long dirty,
> +					 unsigned long thresh)
> +{
> +	if (thresh > dirty)
> +		return 1UL << (ilog2(thresh - dirty) >> 1);
> +
> +	return 1;
> +}
> +
> +/*
>   * balance_dirty_pages() must be called by processes which are generating dirty
>   * data.  It looks at the number of dirty pages in the machine and will force
>   * the caller to perform writeback if the system is over `vm_dirty_ratio'.
> @@ -1072,6 +1077,9 @@ static void balance_dirty_pages(struct a
>  	if (clear_dirty_exceeded && bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;
>  
> +	current->nr_dirtied = 0;
> +	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
> +
>  	if (writeback_in_progress(bdi))
>  		return;
>  
> @@ -1098,7 +1106,7 @@ void set_page_dirty_balance(struct page 
>  	}
>  }
>  
> -static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
> +static DEFINE_PER_CPU(int, bdp_ratelimits);
>  
>  /**
>   * balance_dirty_pages_ratelimited_nr - balance dirty memory state
> @@ -1118,31 +1126,40 @@ void balance_dirty_pages_ratelimited_nr(
>  					unsigned long nr_pages_dirtied)
>  {
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> -	unsigned long ratelimit;
> -	unsigned long *p;
> +	int ratelimit;
> +	int *p;
>  
>  	if (!bdi_cap_account_dirty(bdi))
>  		return;
>  
> -	ratelimit = ratelimit_pages;
> -	if (mapping->backing_dev_info->dirty_exceeded)
> -		ratelimit = 8;
> +	if (!bdi->dirty_exceeded)
> +		ratelimit = current->nr_dirtied_pause;
> +	else
> +		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));

Usage of ratelimit before init?

Maybe:

	ratelimit = current->nr_dirtied_pause;
	if (bdi->dirty_exceeded)
		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));

Thanks,
-Andrea

> +
> +	current->nr_dirtied += nr_pages_dirtied;
>  
> +	preempt_disable();
>  	/*
> -	 * Check the rate limiting. Also, we do not want to throttle real-time
> -	 * tasks in balance_dirty_pages(). Period.
> +	 * This prevents one CPU to accumulate too many dirtied pages without
> +	 * calling into balance_dirty_pages(), which can happen when there are
> +	 * 1000+ tasks, all of them start dirtying pages at exactly the same
> +	 * time, hence all honoured too large initial task->nr_dirtied_pause.
>  	 */
> -	preempt_disable();
>  	p =  &__get_cpu_var(bdp_ratelimits);
> -	*p += nr_pages_dirtied;
> -	if (unlikely(*p >= ratelimit)) {
> -		ratelimit = sync_writeback_pages(*p);
> +	if (unlikely(current->nr_dirtied >= ratelimit))
>  		*p = 0;
> -		preempt_enable();
> -		balance_dirty_pages(mapping, ratelimit);
> -		return;
> +	else {
> +		*p += nr_pages_dirtied;
> +		if (unlikely(*p >= ratelimit_pages)) {
> +			*p = 0;
> +			ratelimit = 0;
> +		}
>  	}
>  	preempt_enable();
> +
> +	if (unlikely(current->nr_dirtied >= ratelimit))
> +		balance_dirty_pages(mapping, current->nr_dirtied);
>  }
>  EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
>  
> @@ -1237,22 +1254,17 @@ void laptop_sync_completion(void)
>   *
>   * Here we set ratelimit_pages to a level which ensures that when all CPUs are
>   * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
> - * thresholds before writeback cuts in.
> - *
> - * But the limit should not be set too high.  Because it also controls the
> - * amount of memory which the balance_dirty_pages() caller has to write back.
> - * If this is too large then the caller will block on the IO queue all the
> - * time.  So limit it to four megabytes - the balance_dirty_pages() caller
> - * will write six megabyte chunks, max.
> + * thresholds.
>   */
>  
>  void writeback_set_ratelimit(void)
>  {
> -	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
> +	unsigned long background_thresh;
> +	unsigned long dirty_thresh;
> +	global_dirty_limits(&background_thresh, &dirty_thresh);
> +	ratelimit_pages = dirty_thresh / (num_online_cpus() * 32);
>  	if (ratelimit_pages < 16)
>  		ratelimit_pages = 16;
> -	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
> -		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
>  }
>  
>  static int __cpuinit
> --- linux-next.orig/kernel/fork.c	2011-08-14 18:03:44.000000000 +0800
> +++ linux-next/kernel/fork.c	2011-08-15 10:26:05.000000000 +0800
> @@ -1301,6 +1301,9 @@ static struct task_struct *copy_process(
>  	p->pdeath_signal = 0;
>  	p->exit_state = 0;
>  
> +	p->nr_dirtied = 0;
> +	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
> +
>  	/*
>  	 * Ok, make it visible to the rest of the system.
>  	 * We dont wake it up yet.
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-16  7:17     ` Andrea Righi
  0 siblings, 0 replies; 98+ messages in thread
From: Andrea Righi @ 2011-08-16  7:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, linux-mm, LKML

On Tue, Aug 16, 2011 at 10:20:10AM +0800, Wu Fengguang wrote:
> Add two fields to task_struct.
> 
> 1) account dirtied pages in the individual tasks, for accuracy
> 2) per-task balance_dirty_pages() call intervals, for flexibility
> 
> The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> scale near-sqrt to the safety gap between dirty pages and threshold.
> 
> The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
> pages at exactly the same time, each task will be assigned a large
> initial nr_dirtied_pause, so that the dirty threshold will be exceeded
> long before each task reached its nr_dirtied_pause and hence call
> balance_dirty_pages().
> 
> The solution is to watch for the number of pages dirtied on each CPU in
> between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
> (3% dirty threshold), force call balance_dirty_pages() for a chance to
> set bdi->dirty_exceeded. In normal situations, this safeguarding
> condition is not expected to trigger at all.
> 
> peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case
> 
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Reviewed-by: Andrea Righi <andrea@betterlinux.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/sched.h |    7 +++
>  kernel/fork.c         |    3 +
>  mm/page-writeback.c   |   90 ++++++++++++++++++++++------------------
>  3 files changed, 61 insertions(+), 39 deletions(-)
> 
> --- linux-next.orig/include/linux/sched.h	2011-08-14 18:03:44.000000000 +0800
> +++ linux-next/include/linux/sched.h	2011-08-15 10:26:05.000000000 +0800
> @@ -1525,6 +1525,13 @@ struct task_struct {
>  	int make_it_fail;
>  #endif
>  	struct prop_local_single dirties;
> +	/*
> +	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
> +	 * balance_dirty_pages() for some dirty throttling pause
> +	 */
> +	int nr_dirtied;
> +	int nr_dirtied_pause;
> +
>  #ifdef CONFIG_LATENCYTOP
>  	int latency_record_count;
>  	struct latency_record latency_record[LT_SAVECOUNT];
> --- linux-next.orig/mm/page-writeback.c	2011-08-15 10:26:04.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-15 13:51:16.000000000 +0800
> @@ -54,20 +54,6 @@
>   */
>  static long ratelimit_pages = 32;
>  
> -/*
> - * When balance_dirty_pages decides that the caller needs to perform some
> - * non-background writeback, this is how many pages it will attempt to write.
> - * It should be somewhat larger than dirtied pages to ensure that reasonably
> - * large amounts of I/O are submitted.
> - */
> -static inline long sync_writeback_pages(unsigned long dirtied)
> -{
> -	if (dirtied < ratelimit_pages)
> -		dirtied = ratelimit_pages;
> -
> -	return dirtied + dirtied / 2;
> -}
> -
>  /* The following parameters are exported via /proc/sys/vm */
>  
>  /*
> @@ -169,6 +155,8 @@ static void update_completion_period(voi
>  	int shift = calc_period_shift();
>  	prop_change_shift(&vm_completions, shift);
>  	prop_change_shift(&vm_dirties, shift);
> +
> +	writeback_set_ratelimit();
>  }
>  
>  int dirty_background_ratio_handler(struct ctl_table *table, int write,
> @@ -930,6 +918,23 @@ static void bdi_update_bandwidth(struct 
>  }
>  
>  /*
> + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
> + * will look to see if it needs to start dirty throttling.
> + *
> + * If dirty_poll_interval is too low, big NUMA machines will call the expensive
> + * global_page_state() too often. So scale it near-sqrt to the safety margin
> + * (the number of pages we may dirty without exceeding the dirty limits).
> + */
> +static unsigned long dirty_poll_interval(unsigned long dirty,
> +					 unsigned long thresh)
> +{
> +	if (thresh > dirty)
> +		return 1UL << (ilog2(thresh - dirty) >> 1);
> +
> +	return 1;
> +}
> +
> +/*
>   * balance_dirty_pages() must be called by processes which are generating dirty
>   * data.  It looks at the number of dirty pages in the machine and will force
>   * the caller to perform writeback if the system is over `vm_dirty_ratio'.
> @@ -1072,6 +1077,9 @@ static void balance_dirty_pages(struct a
>  	if (clear_dirty_exceeded && bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;
>  
> +	current->nr_dirtied = 0;
> +	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
> +
>  	if (writeback_in_progress(bdi))
>  		return;
>  
> @@ -1098,7 +1106,7 @@ void set_page_dirty_balance(struct page 
>  	}
>  }
>  
> -static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
> +static DEFINE_PER_CPU(int, bdp_ratelimits);
>  
>  /**
>   * balance_dirty_pages_ratelimited_nr - balance dirty memory state
> @@ -1118,31 +1126,40 @@ void balance_dirty_pages_ratelimited_nr(
>  					unsigned long nr_pages_dirtied)
>  {
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> -	unsigned long ratelimit;
> -	unsigned long *p;
> +	int ratelimit;
> +	int *p;
>  
>  	if (!bdi_cap_account_dirty(bdi))
>  		return;
>  
> -	ratelimit = ratelimit_pages;
> -	if (mapping->backing_dev_info->dirty_exceeded)
> -		ratelimit = 8;
> +	if (!bdi->dirty_exceeded)
> +		ratelimit = current->nr_dirtied_pause;
> +	else
> +		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));

Usage of ratelimit before init?

Maybe:

	ratelimit = current->nr_dirtied_pause;
	if (bdi->dirty_exceeded)
		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));

Thanks,
-Andrea

> +
> +	current->nr_dirtied += nr_pages_dirtied;
>  
> +	preempt_disable();
>  	/*
> -	 * Check the rate limiting. Also, we do not want to throttle real-time
> -	 * tasks in balance_dirty_pages(). Period.
> +	 * This prevents one CPU to accumulate too many dirtied pages without
> +	 * calling into balance_dirty_pages(), which can happen when there are
> +	 * 1000+ tasks, all of them start dirtying pages at exactly the same
> +	 * time, hence all honoured too large initial task->nr_dirtied_pause.
>  	 */
> -	preempt_disable();
>  	p =  &__get_cpu_var(bdp_ratelimits);
> -	*p += nr_pages_dirtied;
> -	if (unlikely(*p >= ratelimit)) {
> -		ratelimit = sync_writeback_pages(*p);
> +	if (unlikely(current->nr_dirtied >= ratelimit))
>  		*p = 0;
> -		preempt_enable();
> -		balance_dirty_pages(mapping, ratelimit);
> -		return;
> +	else {
> +		*p += nr_pages_dirtied;
> +		if (unlikely(*p >= ratelimit_pages)) {
> +			*p = 0;
> +			ratelimit = 0;
> +		}
>  	}
>  	preempt_enable();
> +
> +	if (unlikely(current->nr_dirtied >= ratelimit))
> +		balance_dirty_pages(mapping, current->nr_dirtied);
>  }
>  EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
>  
> @@ -1237,22 +1254,17 @@ void laptop_sync_completion(void)
>   *
>   * Here we set ratelimit_pages to a level which ensures that when all CPUs are
>   * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
> - * thresholds before writeback cuts in.
> - *
> - * But the limit should not be set too high.  Because it also controls the
> - * amount of memory which the balance_dirty_pages() caller has to write back.
> - * If this is too large then the caller will block on the IO queue all the
> - * time.  So limit it to four megabytes - the balance_dirty_pages() caller
> - * will write six megabyte chunks, max.
> + * thresholds.
>   */
>  
>  void writeback_set_ratelimit(void)
>  {
> -	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
> +	unsigned long background_thresh;
> +	unsigned long dirty_thresh;
> +	global_dirty_limits(&background_thresh, &dirty_thresh);
> +	ratelimit_pages = dirty_thresh / (num_online_cpus() * 32);
>  	if (ratelimit_pages < 16)
>  		ratelimit_pages = 16;
> -	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
> -		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
>  }
>  
>  static int __cpuinit
> --- linux-next.orig/kernel/fork.c	2011-08-14 18:03:44.000000000 +0800
> +++ linux-next/kernel/fork.c	2011-08-15 10:26:05.000000000 +0800
> @@ -1301,6 +1301,9 @@ static struct task_struct *copy_process(
>  	p->pdeath_signal = 0;
>  	p->exit_state = 0;
>  
> +	p->nr_dirtied = 0;
> +	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
> +
>  	/*
>  	 * Ok, make it visible to the rest of the system.
>  	 * We dont wake it up yet.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-16  7:17     ` Andrea Righi
@ 2011-08-16  7:22       ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-16  7:22 UTC (permalink / raw)
  To: Andrea Righi
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, linux-mm, LKML

> > +	if (!bdi->dirty_exceeded)
> > +		ratelimit = current->nr_dirtied_pause;
> > +	else
> > +		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
> 
> Usage of ratelimit before init?
> 
> Maybe:
> 
> 	ratelimit = current->nr_dirtied_pause;
> 	if (bdi->dirty_exceeded)
> 		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));

Good catch, thanks! That's indeed the original form. I changed it to
make the code more aligned...

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-16  7:22       ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-16  7:22 UTC (permalink / raw)
  To: Andrea Righi
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, linux-mm, LKML

> > +	if (!bdi->dirty_exceeded)
> > +		ratelimit = current->nr_dirtied_pause;
> > +	else
> > +		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
> 
> Usage of ratelimit before init?
> 
> Maybe:
> 
> 	ratelimit = current->nr_dirtied_pause;
> 	if (bdi->dirty_exceeded)
> 		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));

Good catch, thanks! That's indeed the original form. I changed it to
make the code more aligned...

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-16  2:20   ` Wu Fengguang
@ 2011-08-16 19:41     ` Jan Kara
  -1 siblings, 0 replies; 98+ messages in thread
From: Jan Kara @ 2011-08-16 19:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

  Hello Fengguang,

  this patch is much easier to read than in older versions! Good work!

> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* the target balance point */
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                         setpoint - dirty 3
> +	 *        f(dirty) := 1 + (----------------)
> +	 *                         limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx       < 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> +	 * fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly.
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> +	setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 */
> +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> +		       thresh + 1);
  I think you can slightly simplify this to:
(thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;


> +	x_intercept = setpoint + 2 * span;
  What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
easily 500 MB, that happens quite often I imagine?

> +
> +	if (unlikely(bdi_dirty > setpoint + span)) {
> +		if (unlikely(bdi_dirty > limit))
> +			return 0;
  Shouldn't this be bdi_thresh instead of limit? I understand this is a
hard limit but with more bdis this condition is rather weak and almost
never true.

> +		if (x_intercept < limit) {
> +			x_intercept = limit;	/* auxiliary control line */
> +			setpoint += span;
> +			pos_ratio >>= 1;
> +		}
  And here you stretch the control area upto the global dirty limit. I
understand you maybe don't want to be really strict and cut control area at
bdi_thresh but your choice looks like too benevolent - when you have
several active bdi's with different speeds this will effectively erase
difference between them, won't it? E.g. with 10 bdi's (x_intercept -
bdi_dirty) / (x_intercept - setpoint) is going to be close to 1 unless
bdi_dirty really heavily exceeds bdi_thresh. So wouldn't it be better to
just make sure control area is reasonably large (e.g. at least 16 MB) to
allow BDI to ramp up it's bdi_thresh but don't extend it upto global dirty
limit?

> +	}
> +	pos_ratio *= x_intercept - bdi_dirty;
> +	do_div(pos_ratio, x_intercept - setpoint + 1);
> +
> +	return pos_ratio;
> +}
> +

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-16 19:41     ` Jan Kara
  0 siblings, 0 replies; 98+ messages in thread
From: Jan Kara @ 2011-08-16 19:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

  Hello Fengguang,

  this patch is much easier to read than in older versions! Good work!

> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* the target balance point */
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                         setpoint - dirty 3
> +	 *        f(dirty) := 1 + (----------------)
> +	 *                         limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx       < 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> +	 * fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly.
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> +	setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 */
> +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> +		       thresh + 1);
  I think you can slightly simplify this to:
(thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;


> +	x_intercept = setpoint + 2 * span;
  What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
easily 500 MB, that happens quite often I imagine?

> +
> +	if (unlikely(bdi_dirty > setpoint + span)) {
> +		if (unlikely(bdi_dirty > limit))
> +			return 0;
  Shouldn't this be bdi_thresh instead of limit? I understand this is a
hard limit but with more bdis this condition is rather weak and almost
never true.

> +		if (x_intercept < limit) {
> +			x_intercept = limit;	/* auxiliary control line */
> +			setpoint += span;
> +			pos_ratio >>= 1;
> +		}
  And here you stretch the control area upto the global dirty limit. I
understand you maybe don't want to be really strict and cut control area at
bdi_thresh but your choice looks like too benevolent - when you have
several active bdi's with different speeds this will effectively erase
difference between them, won't it? E.g. with 10 bdi's (x_intercept -
bdi_dirty) / (x_intercept - setpoint) is going to be close to 1 unless
bdi_dirty really heavily exceeds bdi_thresh. So wouldn't it be better to
just make sure control area is reasonably large (e.g. at least 16 MB) to
allow BDI to ramp up it's bdi_thresh but don't extend it upto global dirty
limit?

> +	}
> +	pos_ratio *= x_intercept - bdi_dirty;
> +	do_div(pos_ratio, x_intercept - setpoint + 1);
> +
> +	return pos_ratio;
> +}
> +

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-16 19:41     ` Jan Kara
  (?)
@ 2011-08-17 13:23     ` Wu Fengguang
  2011-08-17 13:49         ` Wu Fengguang
  2011-08-17 20:24         ` Jan Kara
  -1 siblings, 2 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-17 13:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 6444 bytes --]

Hi Jan,

On Wed, Aug 17, 2011 at 03:41:12AM +0800, Jan Kara wrote:
>   Hello Fengguang,
> 
>   this patch is much easier to read than in older versions! Good work!

Thank you :)

> > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > +					unsigned long thresh,
> > +					unsigned long bg_thresh,
> > +					unsigned long dirty,
> > +					unsigned long bdi_thresh,
> > +					unsigned long bdi_dirty)
> > +{
> > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > +	unsigned long limit = hard_dirty_limit(thresh);
> > +	unsigned long x_intercept;
> > +	unsigned long setpoint;		/* the target balance point */
> > +	unsigned long span;
> > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > +	long x;
> > +
> > +	if (unlikely(dirty >= limit))
> > +		return 0;
> > +
> > +	/*
> > +	 * global setpoint
> > +	 *
> > +	 *                         setpoint - dirty 3
> > +	 *        f(dirty) := 1 + (----------------)
> > +	 *                         limit - setpoint
> > +	 *
> > +	 * it's a 3rd order polynomial that subjects to
> > +	 *
> > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > +	 * (2) f(setpoint) = 1.0 => the balance point
> > +	 * (3) f(limit)    = 0   => the hard limit
> > +	 * (4) df/dx       < 0	 => negative feedback control
> > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > +	 *     => fast response on large errors; small oscillation near setpoint
> > +	 */
> > +	setpoint = (freerun + limit) / 2;
> > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > +		    limit - setpoint + 1);
> > +	pos_ratio = x;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > +
> > +	/*
> > +	 * bdi setpoint
> > +	 *
> > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> > +	 *
> > +	 * The main bdi control line is a linear function that subjects to
> > +	 *
> > +	 * (1) f(setpoint) = 1.0
> > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > +	 *
> > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > +	 * regularly within range
> > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > +	 * fluctuation range for pos_ratio.
> > +	 *
> > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > +	 * own size, so move the slope over accordingly.
> > +	 */
> > +	if (unlikely(bdi_thresh > thresh))
> > +		bdi_thresh = thresh;
> > +	/*
> > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > +	 */
> > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > +	setpoint = setpoint * (u64)x >> 16;
> > +	/*
> > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > +	 */
> > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > +		       thresh + 1);
>   I think you can slightly simplify this to:
> (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;

Good idea!

> > +	x_intercept = setpoint + 2 * span;
>   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> easily 500 MB, that happens quite often I imagine?

That's fine because I no longer target "bdi_thresh" as some limiting
factor as the global "thresh". Due to it being unstable in small
memory JBOD systems, which is the big and unique problem in JBOD.

> > +
> > +	if (unlikely(bdi_dirty > setpoint + span)) {
> > +		if (unlikely(bdi_dirty > limit))
> > +			return 0;
>   Shouldn't this be bdi_thresh instead of limit? I understand this is a
> hard limit but with more bdis this condition is rather weak and almost
> never true.

Yeah, I mean @limit. @bdi_thresh is made weak in IO-less
balance_dirty_pages() in order to get reasonable smooth dirty rate in
the face of a fluctuating @bdi_thresh.

The tradeoff is to let bdi dirty pages fluctuate more or less freely,
as long as they don't drop low and risk IO queue underflow. The
attached patch tries to prevent the underflow (which is good but not
perfect).

> > +		if (x_intercept < limit) {
> > +			x_intercept = limit;	/* auxiliary control line */
> > +			setpoint += span;
> > +			pos_ratio >>= 1;
> > +		}
>   And here you stretch the control area upto the global dirty limit. I
> understand you maybe don't want to be really strict and cut control area at
> bdi_thresh but your choice looks like too benevolent - when you have
> several active bdi's with different speeds this will effectively erase
> difference between them, won't it? E.g. with 10 bdi's (x_intercept -
> bdi_dirty) / (x_intercept - setpoint) is going to be close to 1 unless
> bdi_dirty really heavily exceeds bdi_thresh.

Yes the auxiliary control line could be very flat (small slope).

However it's not normal for the bdi dirty pages to fall into the
range of auxiliary control line at all. And once it takes effect, 
the pos_ratio is at most 0.5 (which is the value at the connection
point with the main bdi control line) which is strong enough to pull
the dirty pages off the auxiliary bdi control line and into the scope
of main bdi control line.

The auxiliary control line is intended for bringing down the bdi_dirty
of the USB key before 250s (where the "pos bandwidth" line keeps low): 

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1UKEY+1HDD-3G/ext4-4dd-1M-8p-2945M-20%25-2.6.38-rc5-dt6+-2011-02-22-09-46/balance_dirty_pages-pages.png

After that the bdi_dirty will fluctuate around bdi_thresh and won't
grow high and step into the scope of the auxiliary control line.

> So wouldn't it be better to
> just make sure control area is reasonably large (e.g. at least 16 MB) to
> allow BDI to ramp up it's bdi_thresh but don't extend it upto global dirty
> limit?

In order to take bdi_thresh as some semi-strict limit, we need to make
it very stable at first..otherwise the whole control system may fluctuate
violently.

Thanks,
Fengguang

> > +	}
> > +	pos_ratio *= x_intercept - bdi_dirty;
> > +	do_div(pos_ratio, x_intercept - setpoint + 1);
> > +
> > +	return pos_ratio;
> > +}
> > +
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

[-- Attachment #2: bdi-reserve-area --]
[-- Type: text/plain, Size: 2539 bytes --]

Subject: writeback: dirty position control - bdi reserve area
Date: Thu Aug 04 22:16:46 CST 2011

Keep a minimal pool of dirty pages for each bdi, so that the disk IO
queues won't underrun.

It's particularly useful for JBOD and small memory system.

XXX:
When memory is small (in comparison to write bandwidth), this control
line may result in (pos_ratio > 1) at the setpoint and push the dirty
pages high. This is more or less intended because the bdi is in the
danger of IO queue underflow. However the global dirty pages, when
pushed close to limit, will eventually conteract our desire to push up
the low bdi_dirty. In low memory JBOD tests we do see disks
under-utilized from time to time.

One scheme that may completely fix this is to add a BDI_queue_empty to
indicate the block IO queue emptiness (but still there may be in flight
IOs on the driver/hardware side) and to unthrottle the tasks regardless
of the global limit on seeing BDI_queue_empty.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-16 09:06:46.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-16 09:06:50.000000000 +0800
@@ -488,6 +488,16 @@ unsigned long bdi_dirty_limit(struct bac
  *   0 +------------.------------------.----------------------*------------->
  *           freerun^          setpoint^                 limit^   dirty pages
  *
+ * (o) bdi reserve area
+ *
+ * The bdi reserve area tries to keep a reasonable number of dirty pages for
+ * preventing block queue underrun.
+ *
+ * reserve area, scale up rate as dirty pages drop low
+ * |<----------------------------------------------->|
+ * |-------------------------------------------------------*-------|----------
+ * 0                                           bdi setpoint^       ^bdi_thresh
+ *
  * (o) bdi control lines
  *
  * The control lines for the global/bdi setpoints both stretch up to @limit.
@@ -571,6 +581,19 @@ static unsigned long bdi_position_ratio(
 	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
 
 	/*
+	 * bdi reserve area, safeguard against dirty pool underrun and disk idle
+	 */
+	x_intercept = min(bdi->avg_write_bandwidth + 2 * MIN_WRITEBACK_PAGES,
+			  freerun);
+	if (bdi_dirty < x_intercept) {
+		if (bdi_dirty > x_intercept / 8) {
+			pos_ratio *= x_intercept;
+			do_div(pos_ratio, bdi_dirty);
+		} else
+			pos_ratio *= 8;
+	}
+
+	/*
 	 * bdi setpoint
 	 *
 	 *        f(dirty) := 1.0 + k * (dirty - setpoint)

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-17 13:23     ` Wu Fengguang
@ 2011-08-17 13:49         ` Wu Fengguang
  2011-08-17 20:24         ` Jan Kara
  1 sibling, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-17 13:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> > > +		if (x_intercept < limit) {
> > > +			x_intercept = limit;	/* auxiliary control line */
> > > +			setpoint += span;
> > > +			pos_ratio >>= 1;
> > > +		}
> >   And here you stretch the control area upto the global dirty limit. I
> > understand you maybe don't want to be really strict and cut control area at
> > bdi_thresh but your choice looks like too benevolent - when you have
> > several active bdi's with different speeds this will effectively erase
> > difference between them, won't it? E.g. with 10 bdi's (x_intercept -
> > bdi_dirty) / (x_intercept - setpoint) is going to be close to 1 unless
> > bdi_dirty really heavily exceeds bdi_thresh.
> 
> Yes the auxiliary control line could be very flat (small slope).
> 
> However it's not normal for the bdi dirty pages to fall into the
> range of auxiliary control line at all. And once it takes effect, 
> the pos_ratio is at most 0.5 (which is the value at the connection
> point with the main bdi control line) which is strong enough to pull
> the dirty pages off the auxiliary bdi control line and into the scope
> of main bdi control line.
> 
> The auxiliary control line is intended for bringing down the bdi_dirty
> of the USB key before 250s (where the "pos bandwidth" line keeps low): 
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1UKEY+1HDD-3G/ext4-4dd-1M-8p-2945M-20%25-2.6.38-rc5-dt6+-2011-02-22-09-46/balance_dirty_pages-pages.png
> 
> After that the bdi_dirty will fluctuate around bdi_thresh and won't
> grow high and step into the scope of the auxiliary control line.

Note that the main/auxiliary bdi control lines won't take effect at
the same time: the main bdi control lines works around and under the
bdi setpoint, and the auxiliary line takes over in the higher scope up
to @limit.

In the 1UKEY+1HDD test case, the bdi_dirty of the UKEY rushes at the
free run stage when global dirty pages are smaller than (thresh+bg_thresh)/2.

So it will be initially under the control the auxiliary line. Hence the
dirtier task will progress at 1/4 to 1/2 of the UKEY's write bandwidth. 
This will bring down the bdi_dirty reasonably fast while still allowing
the dirtier task to make some progress.

The connection point of the main/auxiliary control lines has pos_ratio=0.5.

After 250 second, the main bdi control line takes over, indicated by
the bdi_dirty fluctuating around bdi setpoint and the position rate
(green line) fluctuating around the base ratelimit(blue line).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-17 13:49         ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-17 13:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> > > +		if (x_intercept < limit) {
> > > +			x_intercept = limit;	/* auxiliary control line */
> > > +			setpoint += span;
> > > +			pos_ratio >>= 1;
> > > +		}
> >   And here you stretch the control area upto the global dirty limit. I
> > understand you maybe don't want to be really strict and cut control area at
> > bdi_thresh but your choice looks like too benevolent - when you have
> > several active bdi's with different speeds this will effectively erase
> > difference between them, won't it? E.g. with 10 bdi's (x_intercept -
> > bdi_dirty) / (x_intercept - setpoint) is going to be close to 1 unless
> > bdi_dirty really heavily exceeds bdi_thresh.
> 
> Yes the auxiliary control line could be very flat (small slope).
> 
> However it's not normal for the bdi dirty pages to fall into the
> range of auxiliary control line at all. And once it takes effect, 
> the pos_ratio is at most 0.5 (which is the value at the connection
> point with the main bdi control line) which is strong enough to pull
> the dirty pages off the auxiliary bdi control line and into the scope
> of main bdi control line.
> 
> The auxiliary control line is intended for bringing down the bdi_dirty
> of the USB key before 250s (where the "pos bandwidth" line keeps low): 
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1UKEY+1HDD-3G/ext4-4dd-1M-8p-2945M-20%25-2.6.38-rc5-dt6+-2011-02-22-09-46/balance_dirty_pages-pages.png
> 
> After that the bdi_dirty will fluctuate around bdi_thresh and won't
> grow high and step into the scope of the auxiliary control line.

Note that the main/auxiliary bdi control lines won't take effect at
the same time: the main bdi control lines works around and under the
bdi setpoint, and the auxiliary line takes over in the higher scope up
to @limit.

In the 1UKEY+1HDD test case, the bdi_dirty of the UKEY rushes at the
free run stage when global dirty pages are smaller than (thresh+bg_thresh)/2.

So it will be initially under the control the auxiliary line. Hence the
dirtier task will progress at 1/4 to 1/2 of the UKEY's write bandwidth. 
This will bring down the bdi_dirty reasonably fast while still allowing
the dirtier task to make some progress.

The connection point of the main/auxiliary control lines has pos_ratio=0.5.

After 250 second, the main bdi control line takes over, indicated by
the bdi_dirty fluctuating around bdi setpoint and the position rate
(green line) fluctuating around the base ratelimit(blue line).

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-17 13:23     ` Wu Fengguang
@ 2011-08-17 20:24         ` Jan Kara
  2011-08-17 20:24         ` Jan Kara
  1 sibling, 0 replies; 98+ messages in thread
From: Jan Kara @ 2011-08-17 20:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Peter Zijlstra, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

  Hi Fengguang,

On Wed 17-08-11 21:23:47, Wu Fengguang wrote:
> On Wed, Aug 17, 2011 at 03:41:12AM +0800, Jan Kara wrote:
> > > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > > +					unsigned long thresh,
> > > +					unsigned long bg_thresh,
> > > +					unsigned long dirty,
> > > +					unsigned long bdi_thresh,
> > > +					unsigned long bdi_dirty)
> > > +{
> > > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > > +	unsigned long limit = hard_dirty_limit(thresh);
> > > +	unsigned long x_intercept;
> > > +	unsigned long setpoint;		/* the target balance point */
> > > +	unsigned long span;
> > > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > > +	long x;
> > > +
> > > +	if (unlikely(dirty >= limit))
> > > +		return 0;
> > > +
> > > +	/*
> > > +	 * global setpoint
> > > +	 *
> > > +	 *                         setpoint - dirty 3
> > > +	 *        f(dirty) := 1 + (----------------)
> > > +	 *                         limit - setpoint
> > > +	 *
> > > +	 * it's a 3rd order polynomial that subjects to
> > > +	 *
> > > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > > +	 * (2) f(setpoint) = 1.0 => the balance point
> > > +	 * (3) f(limit)    = 0   => the hard limit
> > > +	 * (4) df/dx       < 0	 => negative feedback control
                          ^^^ Strictly speaking this is <= 0

> > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > +	 */
> > > +	setpoint = (freerun + limit) / 2;
> > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > +		    limit - setpoint + 1);
> > > +	pos_ratio = x;
> > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > +
> > > +	/*
> > > +	 * bdi setpoint
  OK, so if I understand the code right, we now have basic pos_ratio based
on global situation. Now, in the following code, we might scale pos_ratio
further down, if bdi_dirty is too much over bdi's share, right? Do we also
want to scale pos_ratio up, if we are under bdi's share? If yes, do we
really want to do it even if global pos_ratio < 1 (i.e. we are over global
setpoint)?

Maybe we could update the comment with something like:
 * We have computed basic pos_ratio above based on global situation. If the
 * bdi is over its share of dirty pages, we want to scale pos_ratio further
 * down. That is done by the following mechanism:
and now describe how updating works.

> > > +	 *
> > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
                  ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
bdi_setpoint to distinguish clearly from the global value.

> > > +	 *
> > > +	 * The main bdi control line is a linear function that subjects to
> > > +	 *
> > > +	 * (1) f(setpoint) = 1.0
> > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > +	 *
> > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > +	 * regularly within range
> > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > +	 * fluctuation range for pos_ratio.
> > > +	 *
> > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > +	 * own size, so move the slope over accordingly.
> > > +	 */
> > > +	if (unlikely(bdi_thresh > thresh))
> > > +		bdi_thresh = thresh;
> > > +	/*
> > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > +	 */
> > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > +	setpoint = setpoint * (u64)x >> 16;
> > > +	/*
> > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > +	 */
> > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > +		       thresh + 1);
> >   I think you can slightly simplify this to:
> > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> 
> Good idea!
> 
> > > +	x_intercept = setpoint + 2 * span;
   ^^ BTW, why do you have 2*span here? It can result in x_intercept being
~3*bdi_thresh... So maybe you should use bdi_thresh/2 in the computation of
span?

> >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > easily 500 MB, that happens quite often I imagine?
> 
> That's fine because I no longer target "bdi_thresh" as some limiting
> factor as the global "thresh". Due to it being unstable in small
> memory JBOD systems, which is the big and unique problem in JBOD.
  I see. Given the control mechanism below, I think we can try this idea
and see whether it makes problems in practice or not. But the fact that
bdi_thresh is no longer treated as limit should be noted in a changelog -
probably of the last patch (although that is already too long for my taste
so I'll look into how we could make it shorter so that average developer
has enough patience to read it ;).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-17 20:24         ` Jan Kara
  0 siblings, 0 replies; 98+ messages in thread
From: Jan Kara @ 2011-08-17 20:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Peter Zijlstra, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

  Hi Fengguang,

On Wed 17-08-11 21:23:47, Wu Fengguang wrote:
> On Wed, Aug 17, 2011 at 03:41:12AM +0800, Jan Kara wrote:
> > > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > > +					unsigned long thresh,
> > > +					unsigned long bg_thresh,
> > > +					unsigned long dirty,
> > > +					unsigned long bdi_thresh,
> > > +					unsigned long bdi_dirty)
> > > +{
> > > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > > +	unsigned long limit = hard_dirty_limit(thresh);
> > > +	unsigned long x_intercept;
> > > +	unsigned long setpoint;		/* the target balance point */
> > > +	unsigned long span;
> > > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > > +	long x;
> > > +
> > > +	if (unlikely(dirty >= limit))
> > > +		return 0;
> > > +
> > > +	/*
> > > +	 * global setpoint
> > > +	 *
> > > +	 *                         setpoint - dirty 3
> > > +	 *        f(dirty) := 1 + (----------------)
> > > +	 *                         limit - setpoint
> > > +	 *
> > > +	 * it's a 3rd order polynomial that subjects to
> > > +	 *
> > > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > > +	 * (2) f(setpoint) = 1.0 => the balance point
> > > +	 * (3) f(limit)    = 0   => the hard limit
> > > +	 * (4) df/dx       < 0	 => negative feedback control
                          ^^^ Strictly speaking this is <= 0

> > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > +	 */
> > > +	setpoint = (freerun + limit) / 2;
> > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > +		    limit - setpoint + 1);
> > > +	pos_ratio = x;
> > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > +
> > > +	/*
> > > +	 * bdi setpoint
  OK, so if I understand the code right, we now have basic pos_ratio based
on global situation. Now, in the following code, we might scale pos_ratio
further down, if bdi_dirty is too much over bdi's share, right? Do we also
want to scale pos_ratio up, if we are under bdi's share? If yes, do we
really want to do it even if global pos_ratio < 1 (i.e. we are over global
setpoint)?

Maybe we could update the comment with something like:
 * We have computed basic pos_ratio above based on global situation. If the
 * bdi is over its share of dirty pages, we want to scale pos_ratio further
 * down. That is done by the following mechanism:
and now describe how updating works.

> > > +	 *
> > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
                  ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
bdi_setpoint to distinguish clearly from the global value.

> > > +	 *
> > > +	 * The main bdi control line is a linear function that subjects to
> > > +	 *
> > > +	 * (1) f(setpoint) = 1.0
> > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > +	 *
> > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > +	 * regularly within range
> > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > +	 * fluctuation range for pos_ratio.
> > > +	 *
> > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > +	 * own size, so move the slope over accordingly.
> > > +	 */
> > > +	if (unlikely(bdi_thresh > thresh))
> > > +		bdi_thresh = thresh;
> > > +	/*
> > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > +	 */
> > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > +	setpoint = setpoint * (u64)x >> 16;
> > > +	/*
> > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > +	 */
> > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > +		       thresh + 1);
> >   I think you can slightly simplify this to:
> > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> 
> Good idea!
> 
> > > +	x_intercept = setpoint + 2 * span;
   ^^ BTW, why do you have 2*span here? It can result in x_intercept being
~3*bdi_thresh... So maybe you should use bdi_thresh/2 in the computation of
span?

> >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > easily 500 MB, that happens quite often I imagine?
> 
> That's fine because I no longer target "bdi_thresh" as some limiting
> factor as the global "thresh". Due to it being unstable in small
> memory JBOD systems, which is the big and unique problem in JBOD.
  I see. Given the control mechanism below, I think we can try this idea
and see whether it makes problems in practice or not. But the fact that
bdi_thresh is no longer treated as limit should be noted in a changelog -
probably of the last patch (although that is already too long for my taste
so I'll look into how we could make it shorter so that average developer
has enough patience to read it ;).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-17 20:24         ` Jan Kara
@ 2011-08-18  4:18           ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-18  4:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 18, 2011 at 04:24:14AM +0800, Jan Kara wrote:
>   Hi Fengguang,
> 
> On Wed 17-08-11 21:23:47, Wu Fengguang wrote:
> > On Wed, Aug 17, 2011 at 03:41:12AM +0800, Jan Kara wrote:
> > > > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > > > +					unsigned long thresh,
> > > > +					unsigned long bg_thresh,
> > > > +					unsigned long dirty,
> > > > +					unsigned long bdi_thresh,
> > > > +					unsigned long bdi_dirty)
> > > > +{
> > > > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > > > +	unsigned long limit = hard_dirty_limit(thresh);
> > > > +	unsigned long x_intercept;
> > > > +	unsigned long setpoint;		/* the target balance point */
> > > > +	unsigned long span;
> > > > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > > > +	long x;
> > > > +
> > > > +	if (unlikely(dirty >= limit))
> > > > +		return 0;
> > > > +
> > > > +	/*
> > > > +	 * global setpoint
> > > > +	 *
> > > > +	 *                         setpoint - dirty 3
> > > > +	 *        f(dirty) := 1 + (----------------)
> > > > +	 *                         limit - setpoint
> > > > +	 *
> > > > +	 * it's a 3rd order polynomial that subjects to
> > > > +	 *
> > > > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > > > +	 * (2) f(setpoint) = 1.0 => the balance point
> > > > +	 * (3) f(limit)    = 0   => the hard limit
> > > > +	 * (4) df/dx       < 0	 => negative feedback control
>                           ^^^ Strictly speaking this is <= 0

Ah yes, it can be 0 right at the setpoint. 

> > > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > > +	 */
> > > > +	setpoint = (freerun + limit) / 2;
> > > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > > +		    limit - setpoint + 1);
> > > > +	pos_ratio = x;
> > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > > +
> > > > +	/*
> > > > +	 * bdi setpoint
>   OK, so if I understand the code right, we now have basic pos_ratio based
> on global situation. Now, in the following code, we might scale pos_ratio
> further down, if bdi_dirty is too much over bdi's share, right?

Right.

> Do we also want to scale pos_ratio up, if we are under bdi's share?

Yes.

> If yes, do we really want to do it even if global pos_ratio < 1
> (i.e. we are over global setpoint)?

Yes. It's safe because the bdi pos_ratio scale is linear and the
global pos_ratio scale will quickly drop to 0 near @limit, thus
counter-acting any > 1 bdi pos_ratio.

> Maybe we could update the comment with something like:
>  * We have computed basic pos_ratio above based on global situation. If the
>  * bdi is over its share of dirty pages, we want to scale pos_ratio further
>  * down. That is done by the following mechanism:
> and now describe how updating works.

OK.

> > > > +	 *
> > > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
>                   ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
> bdi_setpoint to distinguish clearly from the global value.

OK. I'll add a new variable bdi_setpoint, too, to make it consistent
all over the places.

> > > > +	 *
> > > > +	 * The main bdi control line is a linear function that subjects to
> > > > +	 *
> > > > +	 * (1) f(setpoint) = 1.0
> > > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > > +	 *
> > > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > > +	 * regularly within range
> > > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > > +	 * fluctuation range for pos_ratio.
> > > > +	 *
> > > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > > +	 * own size, so move the slope over accordingly.
> > > > +	 */
> > > > +	if (unlikely(bdi_thresh > thresh))
> > > > +		bdi_thresh = thresh;
> > > > +	/*
> > > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > > +	 */
> > > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > > +	setpoint = setpoint * (u64)x >> 16;
> > > > +	/*
> > > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > > +	 */
> > > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > > +		       thresh + 1);
> > >   I think you can slightly simplify this to:
> > > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> > 
> > Good idea!
> > 
> > > > +	x_intercept = setpoint + 2 * span;
>    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> ~3*bdi_thresh...

Right.

> So maybe you should use bdi_thresh/2 in the computation of span?

Given that at some configurations bdi_thresh can fluctuate to its own
size, I guess the current slope of control line is sharp enough.

Given equations

        span = (x_intercept - bdi_setpoint) / 2
        k = df/dx = -0.5 / span

and the values

        span = bdi_thresh
        dx = bdi_thresh

we get

        df = - dx / (2 * span) = - 1/2

That means, when bdi_dirty deviates bdi_thresh apart, pos_ratio and
hence task ratelimit will fluctuate by -1/2. This is probably more
than the users can tolerate already?

btw. the connection point of main/auxiliary control lines are at

        (x_intercept + bdi_setpoint) / 2 

as shown in the graph of the below updated patch.

> > >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > > easily 500 MB, that happens quite often I imagine?
> > 
> > That's fine because I no longer target "bdi_thresh" as some limiting
> > factor as the global "thresh". Due to it being unstable in small
> > memory JBOD systems, which is the big and unique problem in JBOD.
>   I see. Given the control mechanism below, I think we can try this idea
> and see whether it makes problems in practice or not. But the fact that
> bdi_thresh is no longer treated as limit should be noted in a changelog -
> probably of the last patch (although that is already too long for my taste
> so I'll look into how we could make it shorter so that average developer
> has enough patience to read it ;).

Good point. I'll make it a comment in the last patch.

Thanks,
Fengguang
---
Subject: writeback: dirty position control
Date: Wed Mar 02 16:04:18 CST 2011

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range

	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],

we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = bdi_setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the write bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to bdi_setpoint ~= setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  212 +++++++++++++++++++++++++++++++++++-
 3 files changed, 209 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-18 12:15:24.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -495,6 +503,196 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |<-- span --->| .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0              bdi_setpoint                    x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* dirty pages' target balance point */
+	unsigned long bdi_setpoint;
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                           setpoint - dirty 3
+	 *        f(dirty) := 1.0 + (----------------)
+	 *                           limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx      <= 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * We have computed basic pos_ratio above based on global situation. If
+	 * the bdi is over/under its share of dirty pages, we want to scale
+	 * pos_ratio further down/up. That is done by the following policies:
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
+	 * for various filesystems, so choose a slope that can yield in a
+	 * reasonable 12.5% fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly and choose a slope that
+	 * yields 50% pos_ratio fluctuation when bdi_thresh is suddenly doubled.
+	 */
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
+	 *
+	 *                        x_intercept - bdi_dirty
+	 *                     := --------------------------
+	 *                        x_intercept - bdi_setpoint
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(bdi_setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:
+	 * 	bdi_setpoint = setpoint * bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh + 1);
+	bdi_setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(4*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 *
+	 *        bdi_thresh                  thresh - bdi_thresh
+	 * span = ---------- * (4*write_bw) + ------------------- * bdi_thresh
+	 *          thresh                          thresh
+	 */
+	span = (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) *
+								(u64)x >> 16;
+	x_intercept = bdi_setpoint + 2 * span;
+
+	if (unlikely(bdi_dirty > bdi_setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			bdi_setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -593,6 +791,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -629,6 +828,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -637,8 +837,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -679,7 +879,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -723,8 +924,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-17 20:35:34.000000000 +0800
@@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-17 20:35:34.000000000 +0800
@@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-18  4:18           ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-18  4:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 18, 2011 at 04:24:14AM +0800, Jan Kara wrote:
>   Hi Fengguang,
> 
> On Wed 17-08-11 21:23:47, Wu Fengguang wrote:
> > On Wed, Aug 17, 2011 at 03:41:12AM +0800, Jan Kara wrote:
> > > > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > > > +					unsigned long thresh,
> > > > +					unsigned long bg_thresh,
> > > > +					unsigned long dirty,
> > > > +					unsigned long bdi_thresh,
> > > > +					unsigned long bdi_dirty)
> > > > +{
> > > > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > > > +	unsigned long limit = hard_dirty_limit(thresh);
> > > > +	unsigned long x_intercept;
> > > > +	unsigned long setpoint;		/* the target balance point */
> > > > +	unsigned long span;
> > > > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > > > +	long x;
> > > > +
> > > > +	if (unlikely(dirty >= limit))
> > > > +		return 0;
> > > > +
> > > > +	/*
> > > > +	 * global setpoint
> > > > +	 *
> > > > +	 *                         setpoint - dirty 3
> > > > +	 *        f(dirty) := 1 + (----------------)
> > > > +	 *                         limit - setpoint
> > > > +	 *
> > > > +	 * it's a 3rd order polynomial that subjects to
> > > > +	 *
> > > > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > > > +	 * (2) f(setpoint) = 1.0 => the balance point
> > > > +	 * (3) f(limit)    = 0   => the hard limit
> > > > +	 * (4) df/dx       < 0	 => negative feedback control
>                           ^^^ Strictly speaking this is <= 0

Ah yes, it can be 0 right at the setpoint. 

> > > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > > +	 */
> > > > +	setpoint = (freerun + limit) / 2;
> > > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > > +		    limit - setpoint + 1);
> > > > +	pos_ratio = x;
> > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > > +
> > > > +	/*
> > > > +	 * bdi setpoint
>   OK, so if I understand the code right, we now have basic pos_ratio based
> on global situation. Now, in the following code, we might scale pos_ratio
> further down, if bdi_dirty is too much over bdi's share, right?

Right.

> Do we also want to scale pos_ratio up, if we are under bdi's share?

Yes.

> If yes, do we really want to do it even if global pos_ratio < 1
> (i.e. we are over global setpoint)?

Yes. It's safe because the bdi pos_ratio scale is linear and the
global pos_ratio scale will quickly drop to 0 near @limit, thus
counter-acting any > 1 bdi pos_ratio.

> Maybe we could update the comment with something like:
>  * We have computed basic pos_ratio above based on global situation. If the
>  * bdi is over its share of dirty pages, we want to scale pos_ratio further
>  * down. That is done by the following mechanism:
> and now describe how updating works.

OK.

> > > > +	 *
> > > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
>                   ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
> bdi_setpoint to distinguish clearly from the global value.

OK. I'll add a new variable bdi_setpoint, too, to make it consistent
all over the places.

> > > > +	 *
> > > > +	 * The main bdi control line is a linear function that subjects to
> > > > +	 *
> > > > +	 * (1) f(setpoint) = 1.0
> > > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > > +	 *
> > > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > > +	 * regularly within range
> > > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > > +	 * fluctuation range for pos_ratio.
> > > > +	 *
> > > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > > +	 * own size, so move the slope over accordingly.
> > > > +	 */
> > > > +	if (unlikely(bdi_thresh > thresh))
> > > > +		bdi_thresh = thresh;
> > > > +	/*
> > > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > > +	 */
> > > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > > +	setpoint = setpoint * (u64)x >> 16;
> > > > +	/*
> > > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > > +	 */
> > > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > > +		       thresh + 1);
> > >   I think you can slightly simplify this to:
> > > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> > 
> > Good idea!
> > 
> > > > +	x_intercept = setpoint + 2 * span;
>    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> ~3*bdi_thresh...

Right.

> So maybe you should use bdi_thresh/2 in the computation of span?

Given that at some configurations bdi_thresh can fluctuate to its own
size, I guess the current slope of control line is sharp enough.

Given equations

        span = (x_intercept - bdi_setpoint) / 2
        k = df/dx = -0.5 / span

and the values

        span = bdi_thresh
        dx = bdi_thresh

we get

        df = - dx / (2 * span) = - 1/2

That means, when bdi_dirty deviates bdi_thresh apart, pos_ratio and
hence task ratelimit will fluctuate by -1/2. This is probably more
than the users can tolerate already?

btw. the connection point of main/auxiliary control lines are at

        (x_intercept + bdi_setpoint) / 2 

as shown in the graph of the below updated patch.

> > >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > > easily 500 MB, that happens quite often I imagine?
> > 
> > That's fine because I no longer target "bdi_thresh" as some limiting
> > factor as the global "thresh". Due to it being unstable in small
> > memory JBOD systems, which is the big and unique problem in JBOD.
>   I see. Given the control mechanism below, I think we can try this idea
> and see whether it makes problems in practice or not. But the fact that
> bdi_thresh is no longer treated as limit should be noted in a changelog -
> probably of the last patch (although that is already too long for my taste
> so I'll look into how we could make it shorter so that average developer
> has enough patience to read it ;).

Good point. I'll make it a comment in the last patch.

Thanks,
Fengguang
---
Subject: writeback: dirty position control
Date: Wed Mar 02 16:04:18 CST 2011

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range

	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],

we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = bdi_setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the write bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to bdi_setpoint ~= setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  212 +++++++++++++++++++++++++++++++++++-
 3 files changed, 209 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-18 12:15:24.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -495,6 +503,196 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |<-- span --->| .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0              bdi_setpoint                    x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* dirty pages' target balance point */
+	unsigned long bdi_setpoint;
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                           setpoint - dirty 3
+	 *        f(dirty) := 1.0 + (----------------)
+	 *                           limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx      <= 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * We have computed basic pos_ratio above based on global situation. If
+	 * the bdi is over/under its share of dirty pages, we want to scale
+	 * pos_ratio further down/up. That is done by the following policies:
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
+	 * for various filesystems, so choose a slope that can yield in a
+	 * reasonable 12.5% fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly and choose a slope that
+	 * yields 50% pos_ratio fluctuation when bdi_thresh is suddenly doubled.
+	 */
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
+	 *
+	 *                        x_intercept - bdi_dirty
+	 *                     := --------------------------
+	 *                        x_intercept - bdi_setpoint
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(bdi_setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:
+	 * 	bdi_setpoint = setpoint * bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh + 1);
+	bdi_setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(4*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 *
+	 *        bdi_thresh                  thresh - bdi_thresh
+	 * span = ---------- * (4*write_bw) + ------------------- * bdi_thresh
+	 *          thresh                          thresh
+	 */
+	span = (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) *
+								(u64)x >> 16;
+	x_intercept = bdi_setpoint + 2 * span;
+
+	if (unlikely(bdi_dirty > bdi_setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			bdi_setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -593,6 +791,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -629,6 +828,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -637,8 +837,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -679,7 +879,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -723,8 +924,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-17 20:35:34.000000000 +0800
@@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-17 20:35:34.000000000 +0800
@@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-18  4:18           ` Wu Fengguang
@ 2011-08-18  4:41             ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-18  4:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

Hi Jan,

> > > >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > > > easily 500 MB, that happens quite often I imagine?
> > > 
> > > That's fine because I no longer target "bdi_thresh" as some limiting
> > > factor as the global "thresh". Due to it being unstable in small
> > > memory JBOD systems, which is the big and unique problem in JBOD.
> >   I see. Given the control mechanism below, I think we can try this idea
> > and see whether it makes problems in practice or not. But the fact that
> > bdi_thresh is no longer treated as limit should be noted in a changelog -
> > probably of the last patch (although that is already too long for my taste
> > so I'll look into how we could make it shorter so that average developer
> > has enough patience to read it ;).
> 
> Good point. I'll make it a comment in the last patch.

Just added this comment:

+               /*
+                * bdi_thresh is not treated as some limiting factor as
+                * dirty_thresh, due to reasons
+                * - in JBOD setup, bdi_thresh can fluctuate a lot
+                * - in a system with HDD and USB key, the USB key may somehow
+                *   go into state (bdi_dirty >> bdi_thresh) either because
+                *   bdi_dirty starts high, or because bdi_thresh drops low.
+                *   In this case we don't want to hard throttle the USB key
+                *   dirtiers for 100 seconds until bdi_dirty drops under
+                *   bdi_thresh. Instead the auxiliary bdi control line in
+                *   bdi_position_ratio() will let the dirtier task progress
+                *   at some rate <= (write_bw / 2) for bringing down bdi_dirty.
+                */
                bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-18  4:41             ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-18  4:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

Hi Jan,

> > > >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > > > easily 500 MB, that happens quite often I imagine?
> > > 
> > > That's fine because I no longer target "bdi_thresh" as some limiting
> > > factor as the global "thresh". Due to it being unstable in small
> > > memory JBOD systems, which is the big and unique problem in JBOD.
> >   I see. Given the control mechanism below, I think we can try this idea
> > and see whether it makes problems in practice or not. But the fact that
> > bdi_thresh is no longer treated as limit should be noted in a changelog -
> > probably of the last patch (although that is already too long for my taste
> > so I'll look into how we could make it shorter so that average developer
> > has enough patience to read it ;).
> 
> Good point. I'll make it a comment in the last patch.

Just added this comment:

+               /*
+                * bdi_thresh is not treated as some limiting factor as
+                * dirty_thresh, due to reasons
+                * - in JBOD setup, bdi_thresh can fluctuate a lot
+                * - in a system with HDD and USB key, the USB key may somehow
+                *   go into state (bdi_dirty >> bdi_thresh) either because
+                *   bdi_dirty starts high, or because bdi_thresh drops low.
+                *   In this case we don't want to hard throttle the USB key
+                *   dirtiers for 100 seconds until bdi_dirty drops under
+                *   bdi_thresh. Instead the auxiliary bdi control line in
+                *   bdi_position_ratio() will let the dirtier task progress
+                *   at some rate <= (write_bw / 2) for bringing down bdi_dirty.
+                */
                bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-18  4:18           ` Wu Fengguang
@ 2011-08-18 19:16             ` Jan Kara
  -1 siblings, 0 replies; 98+ messages in thread
From: Jan Kara @ 2011-08-18 19:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Peter Zijlstra, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Thu 18-08-11 12:18:01, Wu Fengguang wrote:
> > > > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > > > +	 */
> > > > > +	setpoint = (freerun + limit) / 2;
> > > > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > > > +		    limit - setpoint + 1);
> > > > > +	pos_ratio = x;
> > > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > > > +
> > > > > +	/*
> > > > > +	 * bdi setpoint
> >   OK, so if I understand the code right, we now have basic pos_ratio based
> > on global situation. Now, in the following code, we might scale pos_ratio
> > further down, if bdi_dirty is too much over bdi's share, right?
> 
> Right.
> 
> > Do we also want to scale pos_ratio up, if we are under bdi's share?
> 
> Yes.
> 
> > If yes, do we really want to do it even if global pos_ratio < 1
> > (i.e. we are over global setpoint)?
> 
> Yes. It's safe because the bdi pos_ratio scale is linear and the
> global pos_ratio scale will quickly drop to 0 near @limit, thus
> counter-acting any > 1 bdi pos_ratio.
  OK. I just wanted to make sure I understand it right :-). I can see
arguments for all the different choices so let's see how it works in
practice...

> > > > > +	 *
> > > > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> >                   ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
> > bdi_setpoint to distinguish clearly from the global value.
> 
> OK. I'll add a new variable bdi_setpoint, too, to make it consistent
> all over the places.
> 
> > > > > +	 *
> > > > > +	 * The main bdi control line is a linear function that subjects to
> > > > > +	 *
> > > > > +	 * (1) f(setpoint) = 1.0
> > > > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > > > +	 *
> > > > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > > > +	 * regularly within range
> > > > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > > > +	 * fluctuation range for pos_ratio.
> > > > > +	 *
> > > > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > > > +	 * own size, so move the slope over accordingly.
> > > > > +	 */
> > > > > +	if (unlikely(bdi_thresh > thresh))
> > > > > +		bdi_thresh = thresh;
> > > > > +	/*
> > > > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > > > +	 */
> > > > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > > > +	setpoint = setpoint * (u64)x >> 16;
> > > > > +	/*
> > > > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > > > +	 */
> > > > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > > > +		       thresh + 1);
> > > >   I think you can slightly simplify this to:
> > > > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> > > 
> > > Good idea!
> > > 
> > > > > +	x_intercept = setpoint + 2 * span;
> >    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> > ~3*bdi_thresh...
> 
> Right.
> 
> > So maybe you should use bdi_thresh/2 in the computation of span?
> 
> Given that at some configurations bdi_thresh can fluctuate to its own
> size, I guess the current slope of control line is sharp enough.
> 
> Given equations
> 
>         span = (x_intercept - bdi_setpoint) / 2
>         k = df/dx = -0.5 / span
> 
> and the values
> 
>         span = bdi_thresh
>         dx = bdi_thresh
> 
> we get
> 
>         df = - dx / (2 * span) = - 1/2
> 
> That means, when bdi_dirty deviates bdi_thresh apart, pos_ratio and
> hence task ratelimit will fluctuate by -1/2. This is probably more
> than the users can tolerate already?
  OK, let's try that.

> ---
> Subject: writeback: dirty position control
> Date: Wed Mar 02 16:04:18 CST 2011
> 
> bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> that the resulted task rate limit can drive the dirty pages back to the
> global/bdi setpoints.
> 
> Old scheme is,
>                                           |
>                            free run area  |  throttle area
>   ----------------------------------------+---------------------------->
>                                     thresh^                  dirty pages
> 
> New scheme is,
> 
>   ^ task rate limit
>   |
>   |            *
>   |             *
>   |              *
>   |[free run]      *      [smooth throttled]
>   |                  *
>   |                     *
>   |                         *
>   ..bdi->dirty_ratelimit..........*
>   |                               .     *
>   |                               .          *
>   |                               .              *
>   |                               .                 *
>   |                               .                    *
>   +-------------------------------.-----------------------*------------>
>                           setpoint^                  limit^  dirty pages
> 
> The slope of the bdi control line should be
> 
> 1) large enough to pull the dirty pages to setpoint reasonably fast
> 
> 2) small enough to avoid big fluctuations in the resulted pos_ratio and
>    hence task ratelimit
> 
> Since the fluctuation range of the bdi dirty pages is typically observed
> to be within 1-second worth of data, the bdi control line's slope is
> selected to be a linear function of bdi write bandwidth, so that it can
> adapt to slow/fast storage devices well.
> 
> Assume the bdi control line
> 
> 	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)
> 
> where k is the negative slope.
> 
> If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
> are fluctuating in range
> 
> 	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],
> 
> we get slope
> 
> 	k = - 1 / (8 * write_bw)
> 
> Let pos_ratio(x_intercept) = 0, we get the parameter used in code:
> 
> 	x_intercept = bdi_setpoint + 8 * write_bw
> 
> The global/bdi slopes are nicely complementing each other when the
> system has only one major bdi (indicated by bdi_thresh ~= thresh):
> 
> 1) slope of global control line    => scaling to the control scope size
> 2) slope of main bdi control line  => scaling to the write bandwidth
> 
> so that
> 
> - in memory tight systems, (1) becomes strong enough to squeeze dirty
>   pages inside the control scope
> 
> - in large memory systems where the "gravity" of (1) for pulling the
>   dirty pages to setpoint is too weak, (2) can back (1) up and drive
>   dirty pages to bdi_setpoint ~= setpoint reasonably fast.
> 
> Unfortunately in JBOD setups, the fluctuation range of bdi threshold
> is related to memory size due to the interferences between disks.  In
> this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.
> 
> peter: use 3rd order polynomial for the global control line
> 
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  OK, I like this patch now. You can add
Acked-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/fs-writeback.c         |    2 
>  include/linux/writeback.h |    1 
>  mm/page-writeback.c       |  212 +++++++++++++++++++++++++++++++++++-
>  3 files changed, 209 insertions(+), 6 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-18 12:15:24.000000000 +0800
> @@ -46,6 +46,8 @@
>   */
>  #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
>  
> +#define RATELIMIT_CALC_SHIFT	10
> +
>  /*
>   * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
>   * will look to see if it needs to force writeback or throttling.
> @@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
>  	return x + 1;	/* Ensure that we never return 0 */
>  }
>  
> +static unsigned long dirty_freerun_ceiling(unsigned long thresh,
> +					   unsigned long bg_thresh)
> +{
> +	return (thresh + bg_thresh) / 2;
> +}
> +
>  static unsigned long hard_dirty_limit(unsigned long thresh)
>  {
>  	return max(thresh, global_dirty_limit);
> @@ -495,6 +503,196 @@ unsigned long bdi_dirty_limit(struct bac
>  	return bdi_dirty;
>  }
>  
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + * We want the dirty pages be balanced around the global/bdi setpoints.
> + * When the number of dirty pages is higher/lower than the setpoint, the
> + * dirty position control ratio (and hence task dirty ratelimit) will be
> + * decreased/increased to bring the dirty pages back to the setpoint.
> + *
> + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> + *
> + *     if (dirty < setpoint) scale up   pos_ratio
> + *     if (dirty > setpoint) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> + *
> + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> + *
> + * (o) global control line
> + *
> + *     ^ pos_ratio
> + *     |
> + *     |            |<===== global dirty control scope ======>|
> + * 2.0 .............*
> + *     |            .*
> + *     |            . *
> + *     |            .   *
> + *     |            .     *
> + *     |            .        *
> + *     |            .            *
> + * 1.0 ................................*
> + *     |            .                  .     *
> + *     |            .                  .          *
> + *     |            .                  .              *
> + *     |            .                  .                 *
> + *     |            .                  .                    *
> + *   0 +------------.------------------.----------------------*------------->
> + *           freerun^          setpoint^                 limit^   dirty pages
> + *
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, rate scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, rate scale = 1/2
> + *                       |<-- span --->| .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0              bdi_setpoint                    x_intercept           limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* dirty pages' target balance point */
> +	unsigned long bdi_setpoint;
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                           setpoint - dirty 3
> +	 *        f(dirty) := 1.0 + (----------------)
> +	 *                           limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx      <= 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * We have computed basic pos_ratio above based on global situation. If
> +	 * the bdi is over/under its share of dirty pages, we want to scale
> +	 * pos_ratio further down/up. That is done by the following policies:
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
> +	 * for various filesystems, so choose a slope that can yield in a
> +	 * reasonable 12.5% fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly and choose a slope that
> +	 * yields 50% pos_ratio fluctuation when bdi_thresh is suddenly doubled.
> +	 */
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
> +	 *
> +	 *                        x_intercept - bdi_dirty
> +	 *                     := --------------------------
> +	 *                        x_intercept - bdi_setpoint
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(bdi_setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:
> +	 * 	bdi_setpoint = setpoint * bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh + 1);
> +	bdi_setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 *
> +	 *        bdi_thresh                  thresh - bdi_thresh
> +	 * span = ---------- * (4*write_bw) + ------------------- * bdi_thresh
> +	 *          thresh                          thresh
> +	 */
> +	span = (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) *
> +								(u64)x >> 16;
> +	x_intercept = bdi_setpoint + 2 * span;
> +
> +	if (unlikely(bdi_dirty > bdi_setpoint + span)) {
> +		if (unlikely(bdi_dirty > limit))
> +			return 0;
> +		if (x_intercept < limit) {
> +			x_intercept = limit;	/* auxiliary control line */
> +			bdi_setpoint += span;
> +			pos_ratio >>= 1;
> +		}
> +	}
> +	pos_ratio *= x_intercept - bdi_dirty;
> +	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
> +
> +	return pos_ratio;
> +}
> +
>  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
>  				       unsigned long elapsed,
>  				       unsigned long written)
> @@ -593,6 +791,7 @@ static void global_update_bandwidth(unsi
>  
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
> +			    unsigned long bg_thresh,
>  			    unsigned long dirty,
>  			    unsigned long bdi_thresh,
>  			    unsigned long bdi_dirty,
> @@ -629,6 +828,7 @@ snapshot:
>  
>  static void bdi_update_bandwidth(struct backing_dev_info *bdi,
>  				 unsigned long thresh,
> +				 unsigned long bg_thresh,
>  				 unsigned long dirty,
>  				 unsigned long bdi_thresh,
>  				 unsigned long bdi_dirty,
> @@ -637,8 +837,8 @@ static void bdi_update_bandwidth(struct 
>  	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
>  		return;
>  	spin_lock(&bdi->wb.list_lock);
> -	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
> -			       start_time);
> +	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
> +			       bdi_thresh, bdi_dirty, start_time);
>  	spin_unlock(&bdi->wb.list_lock);
>  }
>  
> @@ -679,7 +879,8 @@ static void balance_dirty_pages(struct a
>  		 * catch-up. This avoids (excessively) small writeouts
>  		 * when the bdi limits are ramping up.
>  		 */
> -		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
> +		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
> +						      background_thresh))
>  			break;
>  
>  		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> @@ -723,8 +924,9 @@ static void balance_dirty_pages(struct a
>  		if (!bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
> -		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
> -				     bdi_thresh, bdi_dirty, start_time);
> +		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
> +				     nr_dirty, bdi_thresh, bdi_dirty,
> +				     start_time);
>  
>  		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
>  		 * Unstable writes are a feature of certain networked
> --- linux-next.orig/fs/fs-writeback.c	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-08-17 20:35:34.000000000 +0800
> @@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
>  static void wb_update_bandwidth(struct bdi_writeback *wb,
>  				unsigned long start_time)
>  {
> -	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
> +	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
>  }
>  
>  /*
> --- linux-next.orig/include/linux/writeback.h	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-08-17 20:35:34.000000000 +0800
> @@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
>  
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
> +			    unsigned long bg_thresh,
>  			    unsigned long dirty,
>  			    unsigned long bdi_thresh,
>  			    unsigned long bdi_dirty,
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-18 19:16             ` Jan Kara
  0 siblings, 0 replies; 98+ messages in thread
From: Jan Kara @ 2011-08-18 19:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Peter Zijlstra, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Thu 18-08-11 12:18:01, Wu Fengguang wrote:
> > > > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > > > +	 */
> > > > > +	setpoint = (freerun + limit) / 2;
> > > > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > > > +		    limit - setpoint + 1);
> > > > > +	pos_ratio = x;
> > > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > > > +
> > > > > +	/*
> > > > > +	 * bdi setpoint
> >   OK, so if I understand the code right, we now have basic pos_ratio based
> > on global situation. Now, in the following code, we might scale pos_ratio
> > further down, if bdi_dirty is too much over bdi's share, right?
> 
> Right.
> 
> > Do we also want to scale pos_ratio up, if we are under bdi's share?
> 
> Yes.
> 
> > If yes, do we really want to do it even if global pos_ratio < 1
> > (i.e. we are over global setpoint)?
> 
> Yes. It's safe because the bdi pos_ratio scale is linear and the
> global pos_ratio scale will quickly drop to 0 near @limit, thus
> counter-acting any > 1 bdi pos_ratio.
  OK. I just wanted to make sure I understand it right :-). I can see
arguments for all the different choices so let's see how it works in
practice...

> > > > > +	 *
> > > > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> >                   ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
> > bdi_setpoint to distinguish clearly from the global value.
> 
> OK. I'll add a new variable bdi_setpoint, too, to make it consistent
> all over the places.
> 
> > > > > +	 *
> > > > > +	 * The main bdi control line is a linear function that subjects to
> > > > > +	 *
> > > > > +	 * (1) f(setpoint) = 1.0
> > > > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > > > +	 *
> > > > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > > > +	 * regularly within range
> > > > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > > > +	 * fluctuation range for pos_ratio.
> > > > > +	 *
> > > > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > > > +	 * own size, so move the slope over accordingly.
> > > > > +	 */
> > > > > +	if (unlikely(bdi_thresh > thresh))
> > > > > +		bdi_thresh = thresh;
> > > > > +	/*
> > > > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > > > +	 */
> > > > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > > > +	setpoint = setpoint * (u64)x >> 16;
> > > > > +	/*
> > > > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > > > +	 */
> > > > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > > > +		       thresh + 1);
> > > >   I think you can slightly simplify this to:
> > > > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> > > 
> > > Good idea!
> > > 
> > > > > +	x_intercept = setpoint + 2 * span;
> >    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> > ~3*bdi_thresh...
> 
> Right.
> 
> > So maybe you should use bdi_thresh/2 in the computation of span?
> 
> Given that at some configurations bdi_thresh can fluctuate to its own
> size, I guess the current slope of control line is sharp enough.
> 
> Given equations
> 
>         span = (x_intercept - bdi_setpoint) / 2
>         k = df/dx = -0.5 / span
> 
> and the values
> 
>         span = bdi_thresh
>         dx = bdi_thresh
> 
> we get
> 
>         df = - dx / (2 * span) = - 1/2
> 
> That means, when bdi_dirty deviates bdi_thresh apart, pos_ratio and
> hence task ratelimit will fluctuate by -1/2. This is probably more
> than the users can tolerate already?
  OK, let's try that.

> ---
> Subject: writeback: dirty position control
> Date: Wed Mar 02 16:04:18 CST 2011
> 
> bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> that the resulted task rate limit can drive the dirty pages back to the
> global/bdi setpoints.
> 
> Old scheme is,
>                                           |
>                            free run area  |  throttle area
>   ----------------------------------------+---------------------------->
>                                     thresh^                  dirty pages
> 
> New scheme is,
> 
>   ^ task rate limit
>   |
>   |            *
>   |             *
>   |              *
>   |[free run]      *      [smooth throttled]
>   |                  *
>   |                     *
>   |                         *
>   ..bdi->dirty_ratelimit..........*
>   |                               .     *
>   |                               .          *
>   |                               .              *
>   |                               .                 *
>   |                               .                    *
>   +-------------------------------.-----------------------*------------>
>                           setpoint^                  limit^  dirty pages
> 
> The slope of the bdi control line should be
> 
> 1) large enough to pull the dirty pages to setpoint reasonably fast
> 
> 2) small enough to avoid big fluctuations in the resulted pos_ratio and
>    hence task ratelimit
> 
> Since the fluctuation range of the bdi dirty pages is typically observed
> to be within 1-second worth of data, the bdi control line's slope is
> selected to be a linear function of bdi write bandwidth, so that it can
> adapt to slow/fast storage devices well.
> 
> Assume the bdi control line
> 
> 	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)
> 
> where k is the negative slope.
> 
> If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
> are fluctuating in range
> 
> 	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],
> 
> we get slope
> 
> 	k = - 1 / (8 * write_bw)
> 
> Let pos_ratio(x_intercept) = 0, we get the parameter used in code:
> 
> 	x_intercept = bdi_setpoint + 8 * write_bw
> 
> The global/bdi slopes are nicely complementing each other when the
> system has only one major bdi (indicated by bdi_thresh ~= thresh):
> 
> 1) slope of global control line    => scaling to the control scope size
> 2) slope of main bdi control line  => scaling to the write bandwidth
> 
> so that
> 
> - in memory tight systems, (1) becomes strong enough to squeeze dirty
>   pages inside the control scope
> 
> - in large memory systems where the "gravity" of (1) for pulling the
>   dirty pages to setpoint is too weak, (2) can back (1) up and drive
>   dirty pages to bdi_setpoint ~= setpoint reasonably fast.
> 
> Unfortunately in JBOD setups, the fluctuation range of bdi threshold
> is related to memory size due to the interferences between disks.  In
> this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.
> 
> peter: use 3rd order polynomial for the global control line
> 
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  OK, I like this patch now. You can add
Acked-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/fs-writeback.c         |    2 
>  include/linux/writeback.h |    1 
>  mm/page-writeback.c       |  212 +++++++++++++++++++++++++++++++++++-
>  3 files changed, 209 insertions(+), 6 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-18 12:15:24.000000000 +0800
> @@ -46,6 +46,8 @@
>   */
>  #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
>  
> +#define RATELIMIT_CALC_SHIFT	10
> +
>  /*
>   * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
>   * will look to see if it needs to force writeback or throttling.
> @@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
>  	return x + 1;	/* Ensure that we never return 0 */
>  }
>  
> +static unsigned long dirty_freerun_ceiling(unsigned long thresh,
> +					   unsigned long bg_thresh)
> +{
> +	return (thresh + bg_thresh) / 2;
> +}
> +
>  static unsigned long hard_dirty_limit(unsigned long thresh)
>  {
>  	return max(thresh, global_dirty_limit);
> @@ -495,6 +503,196 @@ unsigned long bdi_dirty_limit(struct bac
>  	return bdi_dirty;
>  }
>  
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + * We want the dirty pages be balanced around the global/bdi setpoints.
> + * When the number of dirty pages is higher/lower than the setpoint, the
> + * dirty position control ratio (and hence task dirty ratelimit) will be
> + * decreased/increased to bring the dirty pages back to the setpoint.
> + *
> + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> + *
> + *     if (dirty < setpoint) scale up   pos_ratio
> + *     if (dirty > setpoint) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> + *
> + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> + *
> + * (o) global control line
> + *
> + *     ^ pos_ratio
> + *     |
> + *     |            |<===== global dirty control scope ======>|
> + * 2.0 .............*
> + *     |            .*
> + *     |            . *
> + *     |            .   *
> + *     |            .     *
> + *     |            .        *
> + *     |            .            *
> + * 1.0 ................................*
> + *     |            .                  .     *
> + *     |            .                  .          *
> + *     |            .                  .              *
> + *     |            .                  .                 *
> + *     |            .                  .                    *
> + *   0 +------------.------------------.----------------------*------------->
> + *           freerun^          setpoint^                 limit^   dirty pages
> + *
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, rate scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, rate scale = 1/2
> + *                       |<-- span --->| .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0              bdi_setpoint                    x_intercept           limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* dirty pages' target balance point */
> +	unsigned long bdi_setpoint;
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                           setpoint - dirty 3
> +	 *        f(dirty) := 1.0 + (----------------)
> +	 *                           limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx      <= 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * We have computed basic pos_ratio above based on global situation. If
> +	 * the bdi is over/under its share of dirty pages, we want to scale
> +	 * pos_ratio further down/up. That is done by the following policies:
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
> +	 * for various filesystems, so choose a slope that can yield in a
> +	 * reasonable 12.5% fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly and choose a slope that
> +	 * yields 50% pos_ratio fluctuation when bdi_thresh is suddenly doubled.
> +	 */
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
> +	 *
> +	 *                        x_intercept - bdi_dirty
> +	 *                     := --------------------------
> +	 *                        x_intercept - bdi_setpoint
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(bdi_setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:
> +	 * 	bdi_setpoint = setpoint * bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh + 1);
> +	bdi_setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 *
> +	 *        bdi_thresh                  thresh - bdi_thresh
> +	 * span = ---------- * (4*write_bw) + ------------------- * bdi_thresh
> +	 *          thresh                          thresh
> +	 */
> +	span = (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) *
> +								(u64)x >> 16;
> +	x_intercept = bdi_setpoint + 2 * span;
> +
> +	if (unlikely(bdi_dirty > bdi_setpoint + span)) {
> +		if (unlikely(bdi_dirty > limit))
> +			return 0;
> +		if (x_intercept < limit) {
> +			x_intercept = limit;	/* auxiliary control line */
> +			bdi_setpoint += span;
> +			pos_ratio >>= 1;
> +		}
> +	}
> +	pos_ratio *= x_intercept - bdi_dirty;
> +	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
> +
> +	return pos_ratio;
> +}
> +
>  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
>  				       unsigned long elapsed,
>  				       unsigned long written)
> @@ -593,6 +791,7 @@ static void global_update_bandwidth(unsi
>  
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
> +			    unsigned long bg_thresh,
>  			    unsigned long dirty,
>  			    unsigned long bdi_thresh,
>  			    unsigned long bdi_dirty,
> @@ -629,6 +828,7 @@ snapshot:
>  
>  static void bdi_update_bandwidth(struct backing_dev_info *bdi,
>  				 unsigned long thresh,
> +				 unsigned long bg_thresh,
>  				 unsigned long dirty,
>  				 unsigned long bdi_thresh,
>  				 unsigned long bdi_dirty,
> @@ -637,8 +837,8 @@ static void bdi_update_bandwidth(struct 
>  	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
>  		return;
>  	spin_lock(&bdi->wb.list_lock);
> -	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
> -			       start_time);
> +	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
> +			       bdi_thresh, bdi_dirty, start_time);
>  	spin_unlock(&bdi->wb.list_lock);
>  }
>  
> @@ -679,7 +879,8 @@ static void balance_dirty_pages(struct a
>  		 * catch-up. This avoids (excessively) small writeouts
>  		 * when the bdi limits are ramping up.
>  		 */
> -		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
> +		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
> +						      background_thresh))
>  			break;
>  
>  		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> @@ -723,8 +924,9 @@ static void balance_dirty_pages(struct a
>  		if (!bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
> -		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
> -				     bdi_thresh, bdi_dirty, start_time);
> +		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
> +				     nr_dirty, bdi_thresh, bdi_dirty,
> +				     start_time);
>  
>  		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
>  		 * Unstable writes are a feature of certain networked
> --- linux-next.orig/fs/fs-writeback.c	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-08-17 20:35:34.000000000 +0800
> @@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
>  static void wb_update_bandwidth(struct bdi_writeback *wb,
>  				unsigned long start_time)
>  {
> -	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
> +	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
>  }
>  
>  /*
> --- linux-next.orig/include/linux/writeback.h	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-08-17 20:35:34.000000000 +0800
> @@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
>  
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
> +			    unsigned long bg_thresh,
>  			    unsigned long dirty,
>  			    unsigned long bdi_thresh,
>  			    unsigned long bdi_dirty,
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-16  2:20   ` Wu Fengguang
@ 2011-08-19  2:06     ` Vivek Goyal
  -1 siblings, 0 replies; 98+ messages in thread
From: Vivek Goyal @ 2011-08-19  2:06 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 16, 2011 at 10:20:11AM +0800, Wu Fengguang wrote:

[..]
> +		if (dirty_exceeded && !bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
>  		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
>  				     nr_dirty, bdi_thresh, bdi_dirty,
>  				     start_time);
>  
> -		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> -		 * Unstable writes are a feature of certain networked
> -		 * filesystems (i.e. NFS) in which data may have been
> -		 * written to the server's write cache, but has not yet
> -		 * been flushed to permanent storage.
> -		 * Only move pages to writeback if this bdi is over its
> -		 * threshold otherwise wait until the disk writes catch
> -		 * up.
> -		 */
> -		trace_balance_dirty_start(bdi);
> -		if (bdi_nr_reclaimable > task_bdi_thresh) {
> -			pages_written += writeback_inodes_wb(&bdi->wb,
> -							     write_chunk);
> -			trace_balance_dirty_written(bdi, pages_written);
> -			if (pages_written >= write_chunk)
> -				break;		/* We've done our duty */
> +		if (unlikely(!writeback_in_progress(bdi)))
> +			bdi_start_background_writeback(bdi);
> +
> +		base_rate = bdi->dirty_ratelimit;
> +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> +					       background_thresh, nr_dirty,
> +					       bdi_thresh, bdi_dirty);
> +		if (unlikely(pos_ratio == 0)) {
> +			pause = MAX_PAUSE;
> +			goto pause;
>  		}
> +		task_ratelimit = (u64)base_rate *
> +					pos_ratio >> RATELIMIT_CALC_SHIFT;

Hi Fenguaang,

I am little confused here. I see that you have already taken pos_ratio
into account in bdi_update_dirty_ratelimit() and wondering why to take
that into account again in balance_diry_pages().

We calculated the pos_rate and balanced_rate and adjusted the
bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().

So why are we adjusting this pos_ratio() adjusted limit again with
pos_ratio(). Doesn't it become effectively following (assuming
one is decreasing the dirty rate limit).

base_rate = bdi->dirty_ratelimit
pos_rate = base_rate * pos_ratio();

			  write_bw
balance_rate = pos_rate * --------
			  dirty_bw

delta = max(pos_rate, balance_rate)
bdi->dirty_ratelimit = bdi->dirty_ratelimit - delta;

task_ratelimit = bdi->dirty_ratelimit * pos_ratio().

So we have already taken into account pos_ratio() while calculating new
bdi->dirty_ratelimit. Do we need to take that into account again.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-19  2:06     ` Vivek Goyal
  0 siblings, 0 replies; 98+ messages in thread
From: Vivek Goyal @ 2011-08-19  2:06 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 16, 2011 at 10:20:11AM +0800, Wu Fengguang wrote:

[..]
> +		if (dirty_exceeded && !bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
>  		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
>  				     nr_dirty, bdi_thresh, bdi_dirty,
>  				     start_time);
>  
> -		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> -		 * Unstable writes are a feature of certain networked
> -		 * filesystems (i.e. NFS) in which data may have been
> -		 * written to the server's write cache, but has not yet
> -		 * been flushed to permanent storage.
> -		 * Only move pages to writeback if this bdi is over its
> -		 * threshold otherwise wait until the disk writes catch
> -		 * up.
> -		 */
> -		trace_balance_dirty_start(bdi);
> -		if (bdi_nr_reclaimable > task_bdi_thresh) {
> -			pages_written += writeback_inodes_wb(&bdi->wb,
> -							     write_chunk);
> -			trace_balance_dirty_written(bdi, pages_written);
> -			if (pages_written >= write_chunk)
> -				break;		/* We've done our duty */
> +		if (unlikely(!writeback_in_progress(bdi)))
> +			bdi_start_background_writeback(bdi);
> +
> +		base_rate = bdi->dirty_ratelimit;
> +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> +					       background_thresh, nr_dirty,
> +					       bdi_thresh, bdi_dirty);
> +		if (unlikely(pos_ratio == 0)) {
> +			pause = MAX_PAUSE;
> +			goto pause;
>  		}
> +		task_ratelimit = (u64)base_rate *
> +					pos_ratio >> RATELIMIT_CALC_SHIFT;

Hi Fenguaang,

I am little confused here. I see that you have already taken pos_ratio
into account in bdi_update_dirty_ratelimit() and wondering why to take
that into account again in balance_diry_pages().

We calculated the pos_rate and balanced_rate and adjusted the
bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().

So why are we adjusting this pos_ratio() adjusted limit again with
pos_ratio(). Doesn't it become effectively following (assuming
one is decreasing the dirty rate limit).

base_rate = bdi->dirty_ratelimit
pos_rate = base_rate * pos_ratio();

			  write_bw
balance_rate = pos_rate * --------
			  dirty_bw

delta = max(pos_rate, balance_rate)
bdi->dirty_ratelimit = bdi->dirty_ratelimit - delta;

task_ratelimit = bdi->dirty_ratelimit * pos_ratio().

So we have already taken into account pos_ratio() while calculating new
bdi->dirty_ratelimit. Do we need to take that into account again.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-16  2:20   ` Wu Fengguang
@ 2011-08-19  2:53     ` Vivek Goyal
  -1 siblings, 0 replies; 98+ messages in thread
From: Vivek Goyal @ 2011-08-19  2:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 16, 2011 at 10:20:08AM +0800, Wu Fengguang wrote:

[..]
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + * We want the dirty pages be balanced around the global/bdi setpoints.
> + * When the number of dirty pages is higher/lower than the setpoint, the
> + * dirty position control ratio (and hence task dirty ratelimit) will be
> + * decreased/increased to bring the dirty pages back to the setpoint.
> + *
> + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> + *
> + *     if (dirty < setpoint) scale up   pos_ratio
> + *     if (dirty > setpoint) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> + *
> + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> + *
> + * (o) global control line
> + *
> + *     ^ pos_ratio
> + *     |
> + *     |            |<===== global dirty control scope ======>|
> + * 2.0 .............*
> + *     |            .*
> + *     |            . *
> + *     |            .   *
> + *     |            .     *
> + *     |            .        *
> + *     |            .            *
> + * 1.0 ................................*
> + *     |            .                  .     *
> + *     |            .                  .          *
> + *     |            .                  .              *
> + *     |            .                  .                 *
> + *     |            .                  .                    *
> + *   0 +------------.------------------.----------------------*------------->
> + *           freerun^          setpoint^                 limit^   dirty pages
> + *
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, rate scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, rate scale = 1/2
> + *                       |               .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0                 setpoint                     x_intercept           limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to many times higher than bdi setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* the target balance point */
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                         setpoint - dirty 3
> +	 *        f(dirty) := 1 + (----------------)
> +	 *                         limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx       < 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> +	 * fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly.
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> +	setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 */
> +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> +		       thresh + 1);
> +	x_intercept = setpoint + 2 * span;
> +

Hi Fengguang,

Few very basic queries.

- Why can't we use the same formula for bdi position ratio as gloabl
  position ratio. Are you not looking for similar proporties. Near the
  set point variation is less and away from setup poing throttling is
  faster.

- In the bdi calculation, setpoint seems to be in number of pages and 
  limit (x_intercept) seems to be a combination of nr pages + pages/sec.
  Why it is different from gloabl setpoint and limit. I mean could this
  not have been like global calculation where we try to keep bdi_dirty
  close to bdi_thresh and calculate pos_ratio. 

- In global pos_ratio calculation terminology used is "limit" while
  the same thing seems be being meintioned as x_intercept in bdi position
  ratio calculation.

Am I missing something very basic here.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-19  2:53     ` Vivek Goyal
  0 siblings, 0 replies; 98+ messages in thread
From: Vivek Goyal @ 2011-08-19  2:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 16, 2011 at 10:20:08AM +0800, Wu Fengguang wrote:

[..]
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + * We want the dirty pages be balanced around the global/bdi setpoints.
> + * When the number of dirty pages is higher/lower than the setpoint, the
> + * dirty position control ratio (and hence task dirty ratelimit) will be
> + * decreased/increased to bring the dirty pages back to the setpoint.
> + *
> + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> + *
> + *     if (dirty < setpoint) scale up   pos_ratio
> + *     if (dirty > setpoint) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> + *
> + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> + *
> + * (o) global control line
> + *
> + *     ^ pos_ratio
> + *     |
> + *     |            |<===== global dirty control scope ======>|
> + * 2.0 .............*
> + *     |            .*
> + *     |            . *
> + *     |            .   *
> + *     |            .     *
> + *     |            .        *
> + *     |            .            *
> + * 1.0 ................................*
> + *     |            .                  .     *
> + *     |            .                  .          *
> + *     |            .                  .              *
> + *     |            .                  .                 *
> + *     |            .                  .                    *
> + *   0 +------------.------------------.----------------------*------------->
> + *           freerun^          setpoint^                 limit^   dirty pages
> + *
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, rate scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, rate scale = 1/2
> + *                       |               .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0                 setpoint                     x_intercept           limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to many times higher than bdi setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* the target balance point */
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                         setpoint - dirty 3
> +	 *        f(dirty) := 1 + (----------------)
> +	 *                         limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx       < 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> +	 * fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly.
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> +	setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 */
> +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> +		       thresh + 1);
> +	x_intercept = setpoint + 2 * span;
> +

Hi Fengguang,

Few very basic queries.

- Why can't we use the same formula for bdi position ratio as gloabl
  position ratio. Are you not looking for similar proporties. Near the
  set point variation is less and away from setup poing throttling is
  faster.

- In the bdi calculation, setpoint seems to be in number of pages and 
  limit (x_intercept) seems to be a combination of nr pages + pages/sec.
  Why it is different from gloabl setpoint and limit. I mean could this
  not have been like global calculation where we try to keep bdi_dirty
  close to bdi_thresh and calculate pos_ratio. 

- In global pos_ratio calculation terminology used is "limit" while
  the same thing seems be being meintioned as x_intercept in bdi position
  ratio calculation.

Am I missing something very basic here.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-19  2:06     ` Vivek Goyal
@ 2011-08-19  2:54       ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-19  2:54 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

Hi Vivek,

> > +		base_rate = bdi->dirty_ratelimit;
> > +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > +					       background_thresh, nr_dirty,
> > +					       bdi_thresh, bdi_dirty);
> > +		if (unlikely(pos_ratio == 0)) {
> > +			pause = MAX_PAUSE;
> > +			goto pause;
> >  		}
> > +		task_ratelimit = (u64)base_rate *
> > +					pos_ratio >> RATELIMIT_CALC_SHIFT;
> 
> Hi Fenguaang,
> 
> I am little confused here. I see that you have already taken pos_ratio
> into account in bdi_update_dirty_ratelimit() and wondering why to take
> that into account again in balance_diry_pages().
> 
> We calculated the pos_rate and balanced_rate and adjusted the
> bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().

Good question. There are some inter-dependencies in the calculation,
and the dependency chain is the opposite to the one in your mind:
balance_dirty_pages() used pos_ratio in the first place, so that
bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation
of the balanced dirty rate, too.

Let's return to how the balanced dirty rate is estimated. Please pay
special attention to the last paragraphs below the "......" line.

Start by throttling each dd task at rate

        task_ratelimit = task_ratelimit_0                               (1)
                         (any non-zero initial value is OK)

After 200ms, we measured

        dirty_rate = # of pages dirtied by all dd's / 200ms
        write_bw   = # of pages written to the disk / 200ms

For the aggressive dd dirtiers, the equality holds

        dirty_rate == N * task_rate
                   == N * task_ratelimit
                   == N * task_ratelimit_0                              (2)
Or     
        task_ratelimit_0 = dirty_rate / N                               (3)

Now we conclude that the balanced task ratelimit can be estimated by

        balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (4)

Because with (2) and (3), (4) yields the desired equality (1):

        balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
                      == write_bw / N

.............................................................................

Now let's revisit (1). Since balance_dirty_pages() chooses to execute
the ratelimit

        task_ratelimit = task_ratelimit_0
                       = dirty_ratelimit * pos_ratio                    (5)

Put (5) into (4), we get the final form used in
bdi_update_dirty_ratelimit()

        balanced_rate = (dirty_ratelimit * pos_ratio) * (write_bw / dirty_rate)

So you really need to take (dirty_ratelimit * pos_ratio) as a single entity.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-19  2:54       ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-19  2:54 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

Hi Vivek,

> > +		base_rate = bdi->dirty_ratelimit;
> > +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > +					       background_thresh, nr_dirty,
> > +					       bdi_thresh, bdi_dirty);
> > +		if (unlikely(pos_ratio == 0)) {
> > +			pause = MAX_PAUSE;
> > +			goto pause;
> >  		}
> > +		task_ratelimit = (u64)base_rate *
> > +					pos_ratio >> RATELIMIT_CALC_SHIFT;
> 
> Hi Fenguaang,
> 
> I am little confused here. I see that you have already taken pos_ratio
> into account in bdi_update_dirty_ratelimit() and wondering why to take
> that into account again in balance_diry_pages().
> 
> We calculated the pos_rate and balanced_rate and adjusted the
> bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().

Good question. There are some inter-dependencies in the calculation,
and the dependency chain is the opposite to the one in your mind:
balance_dirty_pages() used pos_ratio in the first place, so that
bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation
of the balanced dirty rate, too.

Let's return to how the balanced dirty rate is estimated. Please pay
special attention to the last paragraphs below the "......" line.

Start by throttling each dd task at rate

        task_ratelimit = task_ratelimit_0                               (1)
                         (any non-zero initial value is OK)

After 200ms, we measured

        dirty_rate = # of pages dirtied by all dd's / 200ms
        write_bw   = # of pages written to the disk / 200ms

For the aggressive dd dirtiers, the equality holds

        dirty_rate == N * task_rate
                   == N * task_ratelimit
                   == N * task_ratelimit_0                              (2)
Or     
        task_ratelimit_0 = dirty_rate / N                               (3)

Now we conclude that the balanced task ratelimit can be estimated by

        balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (4)

Because with (2) and (3), (4) yields the desired equality (1):

        balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
                      == write_bw / N

.............................................................................

Now let's revisit (1). Since balance_dirty_pages() chooses to execute
the ratelimit

        task_ratelimit = task_ratelimit_0
                       = dirty_ratelimit * pos_ratio                    (5)

Put (5) into (4), we get the final form used in
bdi_update_dirty_ratelimit()

        balanced_rate = (dirty_ratelimit * pos_ratio) * (write_bw / dirty_rate)

So you really need to take (dirty_ratelimit * pos_ratio) as a single entity.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-19  2:53     ` Vivek Goyal
@ 2011-08-19  3:25       ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-19  3:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 19, 2011 at 10:53:21AM +0800, Vivek Goyal wrote:
> On Tue, Aug 16, 2011 at 10:20:08AM +0800, Wu Fengguang wrote:
> 
> [..]
> > +/*
> > + * Dirty position control.
> > + *
> > + * (o) global/bdi setpoints
> > + *
> > + * We want the dirty pages be balanced around the global/bdi setpoints.
> > + * When the number of dirty pages is higher/lower than the setpoint, the
> > + * dirty position control ratio (and hence task dirty ratelimit) will be
> > + * decreased/increased to bring the dirty pages back to the setpoint.
> > + *
> > + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> > + *
> > + *     if (dirty < setpoint) scale up   pos_ratio
> > + *     if (dirty > setpoint) scale down pos_ratio
> > + *
> > + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> > + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> > + *
> > + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> > + *
> > + * (o) global control line
> > + *
> > + *     ^ pos_ratio
> > + *     |
> > + *     |            |<===== global dirty control scope ======>|
> > + * 2.0 .............*
> > + *     |            .*
> > + *     |            . *
> > + *     |            .   *
> > + *     |            .     *
> > + *     |            .        *
> > + *     |            .            *
> > + * 1.0 ................................*
> > + *     |            .                  .     *
> > + *     |            .                  .          *
> > + *     |            .                  .              *
> > + *     |            .                  .                 *
> > + *     |            .                  .                    *
> > + *   0 +------------.------------------.----------------------*------------->
> > + *           freerun^          setpoint^                 limit^   dirty pages
> > + *
> > + * (o) bdi control lines
> > + *
> > + * The control lines for the global/bdi setpoints both stretch up to @limit.
> > + * The below figure illustrates the main bdi control line with an auxiliary
> > + * line extending it to @limit.
> > + *
> > + *   o
> > + *     o
> > + *       o                                      [o] main control line
> > + *         o                                    [*] auxiliary control line
> > + *           o
> > + *             o
> > + *               o
> > + *                 o
> > + *                   o
> > + *                     o
> > + *                       o--------------------- balance point, rate scale = 1
> > + *                       | o
> > + *                       |   o
> > + *                       |     o
> > + *                       |       o
> > + *                       |         o
> > + *                       |           o
> > + *                       |             o------- connect point, rate scale = 1/2
> > + *                       |               .*
> > + *                       |                 .   *
> > + *                       |                   .      *
> > + *                       |                     .         *
> > + *                       |                       .           *
> > + *                       |                         .              *
> > + *                       |                           .                 *
> > + *  [--------------------+-----------------------------.--------------------*]
> > + *  0                 setpoint                     x_intercept           limit
> > + *
> > + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> > + * normal if it starts high in situations like
> > + * - start writing to a slow SD card and a fast disk at the same time. The SD
> > + *   card's bdi_dirty may rush to many times higher than bdi setpoint.
> > + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> > + */
> > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > +					unsigned long thresh,
> > +					unsigned long bg_thresh,
> > +					unsigned long dirty,
> > +					unsigned long bdi_thresh,
> > +					unsigned long bdi_dirty)
> > +{
> > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > +	unsigned long limit = hard_dirty_limit(thresh);
> > +	unsigned long x_intercept;
> > +	unsigned long setpoint;		/* the target balance point */
> > +	unsigned long span;
> > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > +	long x;
> > +
> > +	if (unlikely(dirty >= limit))
> > +		return 0;
> > +
> > +	/*
> > +	 * global setpoint
> > +	 *
> > +	 *                         setpoint - dirty 3
> > +	 *        f(dirty) := 1 + (----------------)
> > +	 *                         limit - setpoint
> > +	 *
> > +	 * it's a 3rd order polynomial that subjects to
> > +	 *
> > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > +	 * (2) f(setpoint) = 1.0 => the balance point
> > +	 * (3) f(limit)    = 0   => the hard limit
> > +	 * (4) df/dx       < 0	 => negative feedback control
> > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > +	 *     => fast response on large errors; small oscillation near setpoint
> > +	 */
> > +	setpoint = (freerun + limit) / 2;
> > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > +		    limit - setpoint + 1);
> > +	pos_ratio = x;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > +
> > +	/*
> > +	 * bdi setpoint
> > +	 *
> > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> > +	 *
> > +	 * The main bdi control line is a linear function that subjects to
> > +	 *
> > +	 * (1) f(setpoint) = 1.0
> > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > +	 *
> > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > +	 * regularly within range
> > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > +	 * fluctuation range for pos_ratio.
> > +	 *
> > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > +	 * own size, so move the slope over accordingly.
> > +	 */
> > +	if (unlikely(bdi_thresh > thresh))
> > +		bdi_thresh = thresh;
> > +	/*
> > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > +	 */
> > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > +	setpoint = setpoint * (u64)x >> 16;
> > +	/*
> > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > +	 */
> > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > +		       thresh + 1);
> > +	x_intercept = setpoint + 2 * span;
> > +
> 
> Hi Fengguang,
> 
> Few very basic queries.
> 
> - Why can't we use the same formula for bdi position ratio as gloabl
>   position ratio. Are you not looking for similar proporties. Near the
>   set point variation is less and away from setup poing throttling is
>   faster.

The changelog has more details, however I hope the rephrased summary
can answer this question better.

Firstly, for single bdi case, the different bdi/global formula is
complementing each other, where the bdi's slope is proportional to the
writeout bandwidth, while the global one is scaling to memory size.
In huge memory system, the global position feedback becomes very weak
(even far away from the setpoint).  This is where the bdi control line
can help pull the dirty pages to the setpoint.

Secondly, for JBOD case, the global/bdi dirty thresholds are
fundamentally different. The global one is stable and strong limit,
while the bdi one is fluctuating and hence only suitable be taken as a
weak limit. The other reason to make it a weak limit is, there are
valid situations that (bdi_dirty >> bdi_thresh) and it's desirable to
throttle the dirtier in reasonable small rate rather than to hard
throttle it.

> - In the bdi calculation, setpoint seems to be in number of pages and 
>   limit (x_intercept) seems to be a combination of nr pages + pages/sec.
>   Why it is different from gloabl setpoint and limit. I mean could this
>   not have been like global calculation where we try to keep bdi_dirty
>   close to bdi_thresh and calculate pos_ratio. 

Because the bdi dirty pages are observed to typically fluctuate up to
1-second worth of data. So the write_bw used here is really (1s * write_bw).

> - In global pos_ratio calculation terminology used is "limit" while
>   the same thing seems be being meintioned as x_intercept in bdi position
>   ratio calculation.

Yes. Because the bdi control lines don't intent to do hard limit at all.

It's actually possible for x_intercept to become larger than the global limit.
This means the it's a memory tight system (or the storage is super fast)
where the bdi dirty pages will inevitably fluctuate a lot (up to write_bw).
We just let go of them and let the global formula take the control.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-19  3:25       ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-19  3:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 19, 2011 at 10:53:21AM +0800, Vivek Goyal wrote:
> On Tue, Aug 16, 2011 at 10:20:08AM +0800, Wu Fengguang wrote:
> 
> [..]
> > +/*
> > + * Dirty position control.
> > + *
> > + * (o) global/bdi setpoints
> > + *
> > + * We want the dirty pages be balanced around the global/bdi setpoints.
> > + * When the number of dirty pages is higher/lower than the setpoint, the
> > + * dirty position control ratio (and hence task dirty ratelimit) will be
> > + * decreased/increased to bring the dirty pages back to the setpoint.
> > + *
> > + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> > + *
> > + *     if (dirty < setpoint) scale up   pos_ratio
> > + *     if (dirty > setpoint) scale down pos_ratio
> > + *
> > + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> > + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> > + *
> > + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> > + *
> > + * (o) global control line
> > + *
> > + *     ^ pos_ratio
> > + *     |
> > + *     |            |<===== global dirty control scope ======>|
> > + * 2.0 .............*
> > + *     |            .*
> > + *     |            . *
> > + *     |            .   *
> > + *     |            .     *
> > + *     |            .        *
> > + *     |            .            *
> > + * 1.0 ................................*
> > + *     |            .                  .     *
> > + *     |            .                  .          *
> > + *     |            .                  .              *
> > + *     |            .                  .                 *
> > + *     |            .                  .                    *
> > + *   0 +------------.------------------.----------------------*------------->
> > + *           freerun^          setpoint^                 limit^   dirty pages
> > + *
> > + * (o) bdi control lines
> > + *
> > + * The control lines for the global/bdi setpoints both stretch up to @limit.
> > + * The below figure illustrates the main bdi control line with an auxiliary
> > + * line extending it to @limit.
> > + *
> > + *   o
> > + *     o
> > + *       o                                      [o] main control line
> > + *         o                                    [*] auxiliary control line
> > + *           o
> > + *             o
> > + *               o
> > + *                 o
> > + *                   o
> > + *                     o
> > + *                       o--------------------- balance point, rate scale = 1
> > + *                       | o
> > + *                       |   o
> > + *                       |     o
> > + *                       |       o
> > + *                       |         o
> > + *                       |           o
> > + *                       |             o------- connect point, rate scale = 1/2
> > + *                       |               .*
> > + *                       |                 .   *
> > + *                       |                   .      *
> > + *                       |                     .         *
> > + *                       |                       .           *
> > + *                       |                         .              *
> > + *                       |                           .                 *
> > + *  [--------------------+-----------------------------.--------------------*]
> > + *  0                 setpoint                     x_intercept           limit
> > + *
> > + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> > + * normal if it starts high in situations like
> > + * - start writing to a slow SD card and a fast disk at the same time. The SD
> > + *   card's bdi_dirty may rush to many times higher than bdi setpoint.
> > + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> > + */
> > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > +					unsigned long thresh,
> > +					unsigned long bg_thresh,
> > +					unsigned long dirty,
> > +					unsigned long bdi_thresh,
> > +					unsigned long bdi_dirty)
> > +{
> > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > +	unsigned long limit = hard_dirty_limit(thresh);
> > +	unsigned long x_intercept;
> > +	unsigned long setpoint;		/* the target balance point */
> > +	unsigned long span;
> > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > +	long x;
> > +
> > +	if (unlikely(dirty >= limit))
> > +		return 0;
> > +
> > +	/*
> > +	 * global setpoint
> > +	 *
> > +	 *                         setpoint - dirty 3
> > +	 *        f(dirty) := 1 + (----------------)
> > +	 *                         limit - setpoint
> > +	 *
> > +	 * it's a 3rd order polynomial that subjects to
> > +	 *
> > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > +	 * (2) f(setpoint) = 1.0 => the balance point
> > +	 * (3) f(limit)    = 0   => the hard limit
> > +	 * (4) df/dx       < 0	 => negative feedback control
> > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > +	 *     => fast response on large errors; small oscillation near setpoint
> > +	 */
> > +	setpoint = (freerun + limit) / 2;
> > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > +		    limit - setpoint + 1);
> > +	pos_ratio = x;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > +
> > +	/*
> > +	 * bdi setpoint
> > +	 *
> > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> > +	 *
> > +	 * The main bdi control line is a linear function that subjects to
> > +	 *
> > +	 * (1) f(setpoint) = 1.0
> > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > +	 *
> > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > +	 * regularly within range
> > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > +	 * fluctuation range for pos_ratio.
> > +	 *
> > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > +	 * own size, so move the slope over accordingly.
> > +	 */
> > +	if (unlikely(bdi_thresh > thresh))
> > +		bdi_thresh = thresh;
> > +	/*
> > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > +	 */
> > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > +	setpoint = setpoint * (u64)x >> 16;
> > +	/*
> > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > +	 */
> > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > +		       thresh + 1);
> > +	x_intercept = setpoint + 2 * span;
> > +
> 
> Hi Fengguang,
> 
> Few very basic queries.
> 
> - Why can't we use the same formula for bdi position ratio as gloabl
>   position ratio. Are you not looking for similar proporties. Near the
>   set point variation is less and away from setup poing throttling is
>   faster.

The changelog has more details, however I hope the rephrased summary
can answer this question better.

Firstly, for single bdi case, the different bdi/global formula is
complementing each other, where the bdi's slope is proportional to the
writeout bandwidth, while the global one is scaling to memory size.
In huge memory system, the global position feedback becomes very weak
(even far away from the setpoint).  This is where the bdi control line
can help pull the dirty pages to the setpoint.

Secondly, for JBOD case, the global/bdi dirty thresholds are
fundamentally different. The global one is stable and strong limit,
while the bdi one is fluctuating and hence only suitable be taken as a
weak limit. The other reason to make it a weak limit is, there are
valid situations that (bdi_dirty >> bdi_thresh) and it's desirable to
throttle the dirtier in reasonable small rate rather than to hard
throttle it.

> - In the bdi calculation, setpoint seems to be in number of pages and 
>   limit (x_intercept) seems to be a combination of nr pages + pages/sec.
>   Why it is different from gloabl setpoint and limit. I mean could this
>   not have been like global calculation where we try to keep bdi_dirty
>   close to bdi_thresh and calculate pos_ratio. 

Because the bdi dirty pages are observed to typically fluctuate up to
1-second worth of data. So the write_bw used here is really (1s * write_bw).

> - In global pos_ratio calculation terminology used is "limit" while
>   the same thing seems be being meintioned as x_intercept in bdi position
>   ratio calculation.

Yes. Because the bdi control lines don't intent to do hard limit at all.

It's actually possible for x_intercept to become larger than the global limit.
This means the it's a memory tight system (or the storage is super fast)
where the bdi dirty pages will inevitably fluctuate a lot (up to write_bw).
We just let go of them and let the global formula take the control.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-19  2:54       ` Wu Fengguang
@ 2011-08-19 19:00         ` Vivek Goyal
  -1 siblings, 0 replies; 98+ messages in thread
From: Vivek Goyal @ 2011-08-19 19:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote:
> Hi Vivek,
> 
> > > +		base_rate = bdi->dirty_ratelimit;
> > > +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > > +					       background_thresh, nr_dirty,
> > > +					       bdi_thresh, bdi_dirty);
> > > +		if (unlikely(pos_ratio == 0)) {
> > > +			pause = MAX_PAUSE;
> > > +			goto pause;
> > >  		}
> > > +		task_ratelimit = (u64)base_rate *
> > > +					pos_ratio >> RATELIMIT_CALC_SHIFT;
> > 
> > Hi Fenguaang,
> > 
> > I am little confused here. I see that you have already taken pos_ratio
> > into account in bdi_update_dirty_ratelimit() and wondering why to take
> > that into account again in balance_diry_pages().
> > 
> > We calculated the pos_rate and balanced_rate and adjusted the
> > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().
> 
> Good question. There are some inter-dependencies in the calculation,
> and the dependency chain is the opposite to the one in your mind:
> balance_dirty_pages() used pos_ratio in the first place, so that
> bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation
> of the balanced dirty rate, too.
> 
> Let's return to how the balanced dirty rate is estimated. Please pay
> special attention to the last paragraphs below the "......" line.
> 
> Start by throttling each dd task at rate
> 
>         task_ratelimit = task_ratelimit_0                               (1)
>                          (any non-zero initial value is OK)
> 
> After 200ms, we measured
> 
>         dirty_rate = # of pages dirtied by all dd's / 200ms
>         write_bw   = # of pages written to the disk / 200ms
> 
> For the aggressive dd dirtiers, the equality holds
> 
>         dirty_rate == N * task_rate
>                    == N * task_ratelimit
>                    == N * task_ratelimit_0                              (2)
> Or     
>         task_ratelimit_0 = dirty_rate / N                               (3)
> 
> Now we conclude that the balanced task ratelimit can be estimated by
> 
>         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (4)
> 
> Because with (2) and (3), (4) yields the desired equality (1):
> 
>         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
>                       == write_bw / N

Hi Fengguang,

Following is my understanding. Please correct me where I got it wrong.

Ok, I think I follow till this point. I think what you are saying is
that following is our goal in a stable system.

	task_ratelimit = write_bw/N				(6)

So we measure the write_bw of a bdi over a period of time and use that
as feedback loop to modify bdi->dirty_ratelimit which inturn modifies
task_ratelimit and hence we achieve the balance. So we will start with
some arbitrary task limit say task_ratelimit_0, and modify that limit
over a period of time based on our feedback loop to achieve a balanced
system. And following seems to be the formula.
					    write_bw
	task_ratelimit = task_ratelimit_0 * ------- 		(7)
					    dirty_rate

Now I also understand that by using (2) and (3), you proved that
how (7) will lead to (6) and that is our deisred goal. 

> 
> .............................................................................
> 
> Now let's revisit (1). Since balance_dirty_pages() chooses to execute
> the ratelimit
> 
>         task_ratelimit = task_ratelimit_0
>                        = dirty_ratelimit * pos_ratio                    (5)
> 

So balance_drity_pages() chose to take into account pos_ratio() also
because for various reason like just taking into account only bandwidth
variation as feedback was not sufficient. So we also took pos_ratio
into account which in-trun is dependent on gloabal dirty pages and per
bdi dirty_pages/rate.

So we refined the formula for calculating a tasks's effective rate
over a period of time to following.
					    write_bw
	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
					    dirty_rate

Is my understanding right so far?

> Put (5) into (4), we get the final form used in
> bdi_update_dirty_ratelimit()
> 
>         balanced_rate = (dirty_ratelimit * pos_ratio) * (write_bw / dirty_rate)
> 
> So you really need to take (dirty_ratelimit * pos_ratio) as a single entity.

Now few questions.

- What is dirty_ratelimit in formula above?

- Is it wrong to understand the issue in following manner.

  bdi->dirty_ratelimit is tracking write bandwidth variation on the bdi
  and effectively tracks write_bw/N.

  bdi->dirty_ratelimit = write_bw/N

  or 

					    		  write_bw
  bdi->dirty_ratelimit = previous_bdi->dirty_ratelimit * -------------    (10)
					     		  dirty_rate

 Hence a tasks's balanced rate from (9) and (10) is.

 task_ratelimit = bdi->dirty_ratelimit * pos_ratio		(11)

So my understanding about (10) and (11) is wrong? if no, then question
comes that bdi->dirty_ratelimit is supposed to be keeping track of 
write bandwidth variations only. And in turn task ratelimit will be
driven by both bandwidth varation as well as pos_ratio variation.

But you seem to be doing following.

 bdi->dirty_ratelimit = adjust based on a cobination of bandwidth feedback
		        and pos_ratio feedback. 

 task_ratelimit = bdi->dirty_ratelimit * pos_ratio		(12)

So my question is that when task_ratelimit is finally being adjusted 
based on pos_ratio feedback, why bdi->dirty_ratelimit also needs to
take that into account.

I know you have tried explaining it, but sorry, I did not get it. May
be give it another shot in a layman's terms and I might understand it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-19 19:00         ` Vivek Goyal
  0 siblings, 0 replies; 98+ messages in thread
From: Vivek Goyal @ 2011-08-19 19:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote:
> Hi Vivek,
> 
> > > +		base_rate = bdi->dirty_ratelimit;
> > > +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > > +					       background_thresh, nr_dirty,
> > > +					       bdi_thresh, bdi_dirty);
> > > +		if (unlikely(pos_ratio == 0)) {
> > > +			pause = MAX_PAUSE;
> > > +			goto pause;
> > >  		}
> > > +		task_ratelimit = (u64)base_rate *
> > > +					pos_ratio >> RATELIMIT_CALC_SHIFT;
> > 
> > Hi Fenguaang,
> > 
> > I am little confused here. I see that you have already taken pos_ratio
> > into account in bdi_update_dirty_ratelimit() and wondering why to take
> > that into account again in balance_diry_pages().
> > 
> > We calculated the pos_rate and balanced_rate and adjusted the
> > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().
> 
> Good question. There are some inter-dependencies in the calculation,
> and the dependency chain is the opposite to the one in your mind:
> balance_dirty_pages() used pos_ratio in the first place, so that
> bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation
> of the balanced dirty rate, too.
> 
> Let's return to how the balanced dirty rate is estimated. Please pay
> special attention to the last paragraphs below the "......" line.
> 
> Start by throttling each dd task at rate
> 
>         task_ratelimit = task_ratelimit_0                               (1)
>                          (any non-zero initial value is OK)
> 
> After 200ms, we measured
> 
>         dirty_rate = # of pages dirtied by all dd's / 200ms
>         write_bw   = # of pages written to the disk / 200ms
> 
> For the aggressive dd dirtiers, the equality holds
> 
>         dirty_rate == N * task_rate
>                    == N * task_ratelimit
>                    == N * task_ratelimit_0                              (2)
> Or     
>         task_ratelimit_0 = dirty_rate / N                               (3)
> 
> Now we conclude that the balanced task ratelimit can be estimated by
> 
>         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (4)
> 
> Because with (2) and (3), (4) yields the desired equality (1):
> 
>         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
>                       == write_bw / N

Hi Fengguang,

Following is my understanding. Please correct me where I got it wrong.

Ok, I think I follow till this point. I think what you are saying is
that following is our goal in a stable system.

	task_ratelimit = write_bw/N				(6)

So we measure the write_bw of a bdi over a period of time and use that
as feedback loop to modify bdi->dirty_ratelimit which inturn modifies
task_ratelimit and hence we achieve the balance. So we will start with
some arbitrary task limit say task_ratelimit_0, and modify that limit
over a period of time based on our feedback loop to achieve a balanced
system. And following seems to be the formula.
					    write_bw
	task_ratelimit = task_ratelimit_0 * ------- 		(7)
					    dirty_rate

Now I also understand that by using (2) and (3), you proved that
how (7) will lead to (6) and that is our deisred goal. 

> 
> .............................................................................
> 
> Now let's revisit (1). Since balance_dirty_pages() chooses to execute
> the ratelimit
> 
>         task_ratelimit = task_ratelimit_0
>                        = dirty_ratelimit * pos_ratio                    (5)
> 

So balance_drity_pages() chose to take into account pos_ratio() also
because for various reason like just taking into account only bandwidth
variation as feedback was not sufficient. So we also took pos_ratio
into account which in-trun is dependent on gloabal dirty pages and per
bdi dirty_pages/rate.

So we refined the formula for calculating a tasks's effective rate
over a period of time to following.
					    write_bw
	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
					    dirty_rate

Is my understanding right so far?

> Put (5) into (4), we get the final form used in
> bdi_update_dirty_ratelimit()
> 
>         balanced_rate = (dirty_ratelimit * pos_ratio) * (write_bw / dirty_rate)
> 
> So you really need to take (dirty_ratelimit * pos_ratio) as a single entity.

Now few questions.

- What is dirty_ratelimit in formula above?

- Is it wrong to understand the issue in following manner.

  bdi->dirty_ratelimit is tracking write bandwidth variation on the bdi
  and effectively tracks write_bw/N.

  bdi->dirty_ratelimit = write_bw/N

  or 

					    		  write_bw
  bdi->dirty_ratelimit = previous_bdi->dirty_ratelimit * -------------    (10)
					     		  dirty_rate

 Hence a tasks's balanced rate from (9) and (10) is.

 task_ratelimit = bdi->dirty_ratelimit * pos_ratio		(11)

So my understanding about (10) and (11) is wrong? if no, then question
comes that bdi->dirty_ratelimit is supposed to be keeping track of 
write bandwidth variations only. And in turn task ratelimit will be
driven by both bandwidth varation as well as pos_ratio variation.

But you seem to be doing following.

 bdi->dirty_ratelimit = adjust based on a cobination of bandwidth feedback
		        and pos_ratio feedback. 

 task_ratelimit = bdi->dirty_ratelimit * pos_ratio		(12)

So my question is that when task_ratelimit is finally being adjusted 
based on pos_ratio feedback, why bdi->dirty_ratelimit also needs to
take that into account.

I know you have tried explaining it, but sorry, I did not get it. May
be give it another shot in a layman's terms and I might understand it.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-19 19:00         ` Vivek Goyal
@ 2011-08-21  3:46           ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-21  3:46 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Sat, Aug 20, 2011 at 03:00:37AM +0800, Vivek Goyal wrote:
> On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote:
> > Hi Vivek,
> > 
> > > > +		base_rate = bdi->dirty_ratelimit;
> > > > +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > > > +					       background_thresh, nr_dirty,
> > > > +					       bdi_thresh, bdi_dirty);
> > > > +		if (unlikely(pos_ratio == 0)) {
> > > > +			pause = MAX_PAUSE;
> > > > +			goto pause;
> > > >  		}
> > > > +		task_ratelimit = (u64)base_rate *
> > > > +					pos_ratio >> RATELIMIT_CALC_SHIFT;
> > > 
> > > Hi Fenguaang,
> > > 
> > > I am little confused here. I see that you have already taken pos_ratio
> > > into account in bdi_update_dirty_ratelimit() and wondering why to take
> > > that into account again in balance_diry_pages().
> > > 
> > > We calculated the pos_rate and balanced_rate and adjusted the
> > > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().
> > 
> > Good question. There are some inter-dependencies in the calculation,
> > and the dependency chain is the opposite to the one in your mind:
> > balance_dirty_pages() used pos_ratio in the first place, so that
> > bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation
> > of the balanced dirty rate, too.
> > 
> > Let's return to how the balanced dirty rate is estimated. Please pay
> > special attention to the last paragraphs below the "......" line.
> > 
> > Start by throttling each dd task at rate
> > 
> >         task_ratelimit = task_ratelimit_0                               (1)
> >                          (any non-zero initial value is OK)
> > 
> > After 200ms, we measured
> > 
> >         dirty_rate = # of pages dirtied by all dd's / 200ms
> >         write_bw   = # of pages written to the disk / 200ms
> > 
> > For the aggressive dd dirtiers, the equality holds
> > 
> >         dirty_rate == N * task_rate
> >                    == N * task_ratelimit
> >                    == N * task_ratelimit_0                              (2)
> > Or     
> >         task_ratelimit_0 = dirty_rate / N                               (3)
> > 
> > Now we conclude that the balanced task ratelimit can be estimated by
> > 
> >         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (4)
> > 
> > Because with (2) and (3), (4) yields the desired equality (1):
> > 
> >         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> >                       == write_bw / N
> 
> Hi Fengguang,
> 
> Following is my understanding. Please correct me where I got it wrong.
> 
> Ok, I think I follow till this point. I think what you are saying is
> that following is our goal in a stable system.
> 
> 	task_ratelimit = write_bw/N				(6)
> 
> So we measure the write_bw of a bdi over a period of time and use that
> as feedback loop to modify bdi->dirty_ratelimit which inturn modifies
> task_ratelimit and hence we achieve the balance. So we will start with
> some arbitrary task limit say task_ratelimit_0, and modify that limit
> over a period of time based on our feedback loop to achieve a balanced
> system. And following seems to be the formula.
> 					    write_bw
> 	task_ratelimit = task_ratelimit_0 * ------- 		(7)
> 					    dirty_rate
> 
> Now I also understand that by using (2) and (3), you proved that
> how (7) will lead to (6) and that is our deisred goal. 

That's right.

> > 
> > .............................................................................
> > 
> > Now let's revisit (1). Since balance_dirty_pages() chooses to execute
> > the ratelimit
> > 
> >         task_ratelimit = task_ratelimit_0
> >                        = dirty_ratelimit * pos_ratio                    (5)
> > 
> 
> So balance_drity_pages() chose to take into account pos_ratio() also
> because for various reason like just taking into account only bandwidth
> variation as feedback was not sufficient. So we also took pos_ratio
> into account which in-trun is dependent on gloabal dirty pages and per
> bdi dirty_pages/rate.

That's right so far. balance_drity_pages() needs to do dirty position
control, so used formula (5).

> So we refined the formula for calculating a tasks's effective rate
> over a period of time to following.
> 					    write_bw
> 	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
> 					    dirty_rate
> 

That's not true. It should still be formula (7) when
balance_drity_pages() considers pos_ratio.

> > Put (5) into (4), we get the final form used in
> > bdi_update_dirty_ratelimit()
> > 
> >         balanced_rate = (dirty_ratelimit * pos_ratio) * (write_bw / dirty_rate)
> > 
> > So you really need to take (dirty_ratelimit * pos_ratio) as a single entity.
> 
> Now few questions.
> 
> - What is dirty_ratelimit in formula above?

It's bdi->dirty_ratelimit.

> - Is it wrong to understand the issue in following manner.
> 
>   bdi->dirty_ratelimit is tracking write bandwidth variation on the bdi
>   and effectively tracks write_bw/N.
> 
>   bdi->dirty_ratelimit = write_bw/N

Yes. Strictly speaking, the target value is (note the "==")

        bdi->dirty_ratelimit == write_bw/N

>   or 
> 
> 					    		  write_bw
>   bdi->dirty_ratelimit = previous_bdi->dirty_ratelimit * -------------    (10)
> 					     		  dirty_rate

Both (9) and (10) are not true. The right form is

                                                                     write_bw
balanced_rate = whatever_ratelimit_executed_in_balance_dirty_pages * ----------
                                                                     dirty_rate

where

whatever_ratelimit_executed_in_balance_dirty_pages ~= bdi->dirty_ratelimit * pos_ratio
bdi->dirty_ratelimit ~= balanced_rate

>  Hence a tasks's balanced rate from (9) and (10) is.
> 
>  task_ratelimit = bdi->dirty_ratelimit * pos_ratio		(11)
> So my understanding about (10) and (11) is wrong? if no, then question
> comes that

(11) in itself is right. It's the exact form used in code.
 
> bdi->dirty_ratelimit is supposed to be keeping track of 
> write bandwidth variations only.

Yes in a stable workload. Besides, if the number of dd tasks (N)
changed, dirty_ratelimit will adapt to new value (write_bw / N).

> And in turn task ratelimit will be
> driven by both bandwidth varation as well as pos_ratio variation.

That's right.
 
> But you seem to be doing following.
> 
>  bdi->dirty_ratelimit = adjust based on a cobination of bandwidth feedback
> 		        and pos_ratio feedback. 
> 
>  task_ratelimit = bdi->dirty_ratelimit * pos_ratio		(12)
> 
> So my question is that when task_ratelimit is finally being adjusted 
> based on pos_ratio feedback, why bdi->dirty_ratelimit also needs to
> take that into account.

In _concept_, bdi->dirty_ratelimit only depends on
whatever_ratelimit_executed_in_balance_dirty_pages.

Then, we try to estimate the latter with formula

whatever_ratelimit_executed_in_balance_dirty_pages ~= bdi->dirty_ratelimit * pos_ratio

That is the main reason we want to limit the step size of bdi->dirty_ratelimit:
otherwise the above estimation will have big errors if bdi->dirty_ratelimit
has changed a lot during the past 200ms.

That's also the reason balanced_rate will have larger errors when
close to @limit: because there pos_ratio drops _quickly_ to 0, hence
the regular fluctuations in dirty pages will result in big
fluctuations in the _relative_ value of pos_ratio.

> I know you have tried explaining it, but sorry, I did not get it. May
> be give it another shot in a layman's terms and I might understand it.

Sorry for that. I can explain if you have more questions :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-21  3:46           ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-21  3:46 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Sat, Aug 20, 2011 at 03:00:37AM +0800, Vivek Goyal wrote:
> On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote:
> > Hi Vivek,
> > 
> > > > +		base_rate = bdi->dirty_ratelimit;
> > > > +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > > > +					       background_thresh, nr_dirty,
> > > > +					       bdi_thresh, bdi_dirty);
> > > > +		if (unlikely(pos_ratio == 0)) {
> > > > +			pause = MAX_PAUSE;
> > > > +			goto pause;
> > > >  		}
> > > > +		task_ratelimit = (u64)base_rate *
> > > > +					pos_ratio >> RATELIMIT_CALC_SHIFT;
> > > 
> > > Hi Fenguaang,
> > > 
> > > I am little confused here. I see that you have already taken pos_ratio
> > > into account in bdi_update_dirty_ratelimit() and wondering why to take
> > > that into account again in balance_diry_pages().
> > > 
> > > We calculated the pos_rate and balanced_rate and adjusted the
> > > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().
> > 
> > Good question. There are some inter-dependencies in the calculation,
> > and the dependency chain is the opposite to the one in your mind:
> > balance_dirty_pages() used pos_ratio in the first place, so that
> > bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation
> > of the balanced dirty rate, too.
> > 
> > Let's return to how the balanced dirty rate is estimated. Please pay
> > special attention to the last paragraphs below the "......" line.
> > 
> > Start by throttling each dd task at rate
> > 
> >         task_ratelimit = task_ratelimit_0                               (1)
> >                          (any non-zero initial value is OK)
> > 
> > After 200ms, we measured
> > 
> >         dirty_rate = # of pages dirtied by all dd's / 200ms
> >         write_bw   = # of pages written to the disk / 200ms
> > 
> > For the aggressive dd dirtiers, the equality holds
> > 
> >         dirty_rate == N * task_rate
> >                    == N * task_ratelimit
> >                    == N * task_ratelimit_0                              (2)
> > Or     
> >         task_ratelimit_0 = dirty_rate / N                               (3)
> > 
> > Now we conclude that the balanced task ratelimit can be estimated by
> > 
> >         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (4)
> > 
> > Because with (2) and (3), (4) yields the desired equality (1):
> > 
> >         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> >                       == write_bw / N
> 
> Hi Fengguang,
> 
> Following is my understanding. Please correct me where I got it wrong.
> 
> Ok, I think I follow till this point. I think what you are saying is
> that following is our goal in a stable system.
> 
> 	task_ratelimit = write_bw/N				(6)
> 
> So we measure the write_bw of a bdi over a period of time and use that
> as feedback loop to modify bdi->dirty_ratelimit which inturn modifies
> task_ratelimit and hence we achieve the balance. So we will start with
> some arbitrary task limit say task_ratelimit_0, and modify that limit
> over a period of time based on our feedback loop to achieve a balanced
> system. And following seems to be the formula.
> 					    write_bw
> 	task_ratelimit = task_ratelimit_0 * ------- 		(7)
> 					    dirty_rate
> 
> Now I also understand that by using (2) and (3), you proved that
> how (7) will lead to (6) and that is our deisred goal. 

That's right.

> > 
> > .............................................................................
> > 
> > Now let's revisit (1). Since balance_dirty_pages() chooses to execute
> > the ratelimit
> > 
> >         task_ratelimit = task_ratelimit_0
> >                        = dirty_ratelimit * pos_ratio                    (5)
> > 
> 
> So balance_drity_pages() chose to take into account pos_ratio() also
> because for various reason like just taking into account only bandwidth
> variation as feedback was not sufficient. So we also took pos_ratio
> into account which in-trun is dependent on gloabal dirty pages and per
> bdi dirty_pages/rate.

That's right so far. balance_drity_pages() needs to do dirty position
control, so used formula (5).

> So we refined the formula for calculating a tasks's effective rate
> over a period of time to following.
> 					    write_bw
> 	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
> 					    dirty_rate
> 

That's not true. It should still be formula (7) when
balance_drity_pages() considers pos_ratio.

> > Put (5) into (4), we get the final form used in
> > bdi_update_dirty_ratelimit()
> > 
> >         balanced_rate = (dirty_ratelimit * pos_ratio) * (write_bw / dirty_rate)
> > 
> > So you really need to take (dirty_ratelimit * pos_ratio) as a single entity.
> 
> Now few questions.
> 
> - What is dirty_ratelimit in formula above?

It's bdi->dirty_ratelimit.

> - Is it wrong to understand the issue in following manner.
> 
>   bdi->dirty_ratelimit is tracking write bandwidth variation on the bdi
>   and effectively tracks write_bw/N.
> 
>   bdi->dirty_ratelimit = write_bw/N

Yes. Strictly speaking, the target value is (note the "==")

        bdi->dirty_ratelimit == write_bw/N

>   or 
> 
> 					    		  write_bw
>   bdi->dirty_ratelimit = previous_bdi->dirty_ratelimit * -------------    (10)
> 					     		  dirty_rate

Both (9) and (10) are not true. The right form is

                                                                     write_bw
balanced_rate = whatever_ratelimit_executed_in_balance_dirty_pages * ----------
                                                                     dirty_rate

where

whatever_ratelimit_executed_in_balance_dirty_pages ~= bdi->dirty_ratelimit * pos_ratio
bdi->dirty_ratelimit ~= balanced_rate

>  Hence a tasks's balanced rate from (9) and (10) is.
> 
>  task_ratelimit = bdi->dirty_ratelimit * pos_ratio		(11)
> So my understanding about (10) and (11) is wrong? if no, then question
> comes that

(11) in itself is right. It's the exact form used in code.
 
> bdi->dirty_ratelimit is supposed to be keeping track of 
> write bandwidth variations only.

Yes in a stable workload. Besides, if the number of dd tasks (N)
changed, dirty_ratelimit will adapt to new value (write_bw / N).

> And in turn task ratelimit will be
> driven by both bandwidth varation as well as pos_ratio variation.

That's right.
 
> But you seem to be doing following.
> 
>  bdi->dirty_ratelimit = adjust based on a cobination of bandwidth feedback
> 		        and pos_ratio feedback. 
> 
>  task_ratelimit = bdi->dirty_ratelimit * pos_ratio		(12)
> 
> So my question is that when task_ratelimit is finally being adjusted 
> based on pos_ratio feedback, why bdi->dirty_ratelimit also needs to
> take that into account.

In _concept_, bdi->dirty_ratelimit only depends on
whatever_ratelimit_executed_in_balance_dirty_pages.

Then, we try to estimate the latter with formula

whatever_ratelimit_executed_in_balance_dirty_pages ~= bdi->dirty_ratelimit * pos_ratio

That is the main reason we want to limit the step size of bdi->dirty_ratelimit:
otherwise the above estimation will have big errors if bdi->dirty_ratelimit
has changed a lot during the past 200ms.

That's also the reason balanced_rate will have larger errors when
close to @limit: because there pos_ratio drops _quickly_ to 0, hence
the regular fluctuations in dirty pages will result in big
fluctuations in the _relative_ value of pos_ratio.

> I know you have tried explaining it, but sorry, I did not get it. May
> be give it another shot in a layman's terms and I might understand it.

Sorry for that. I can explain if you have more questions :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-21  3:46           ` Wu Fengguang
@ 2011-08-22 17:22             ` Vivek Goyal
  -1 siblings, 0 replies; 98+ messages in thread
From: Vivek Goyal @ 2011-08-22 17:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Sun, Aug 21, 2011 at 11:46:58AM +0800, Wu Fengguang wrote:
> On Sat, Aug 20, 2011 at 03:00:37AM +0800, Vivek Goyal wrote:
> > On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote:
> > > Hi Vivek,
> > > 
> > > > > +		base_rate = bdi->dirty_ratelimit;
> > > > > +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > > > > +					       background_thresh, nr_dirty,
> > > > > +					       bdi_thresh, bdi_dirty);
> > > > > +		if (unlikely(pos_ratio == 0)) {
> > > > > +			pause = MAX_PAUSE;
> > > > > +			goto pause;
> > > > >  		}
> > > > > +		task_ratelimit = (u64)base_rate *
> > > > > +					pos_ratio >> RATELIMIT_CALC_SHIFT;
> > > > 
> > > > Hi Fenguaang,
> > > > 
> > > > I am little confused here. I see that you have already taken pos_ratio
> > > > into account in bdi_update_dirty_ratelimit() and wondering why to take
> > > > that into account again in balance_diry_pages().
> > > > 
> > > > We calculated the pos_rate and balanced_rate and adjusted the
> > > > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().
> > > 
> > > Good question. There are some inter-dependencies in the calculation,
> > > and the dependency chain is the opposite to the one in your mind:
> > > balance_dirty_pages() used pos_ratio in the first place, so that
> > > bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation
> > > of the balanced dirty rate, too.
> > > 
> > > Let's return to how the balanced dirty rate is estimated. Please pay
> > > special attention to the last paragraphs below the "......" line.
> > > 
> > > Start by throttling each dd task at rate
> > > 
> > >         task_ratelimit = task_ratelimit_0                               (1)
> > >                          (any non-zero initial value is OK)
> > > 
> > > After 200ms, we measured
> > > 
> > >         dirty_rate = # of pages dirtied by all dd's / 200ms
> > >         write_bw   = # of pages written to the disk / 200ms
> > > 
> > > For the aggressive dd dirtiers, the equality holds
> > > 
> > >         dirty_rate == N * task_rate
> > >                    == N * task_ratelimit
> > >                    == N * task_ratelimit_0                              (2)
> > > Or     
> > >         task_ratelimit_0 = dirty_rate / N                               (3)
> > > 
> > > Now we conclude that the balanced task ratelimit can be estimated by
> > > 
> > >         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (4)
> > > 
> > > Because with (2) and (3), (4) yields the desired equality (1):
> > > 
> > >         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> > >                       == write_bw / N
> > 
> > Hi Fengguang,
> > 
> > Following is my understanding. Please correct me where I got it wrong.
> > 
> > Ok, I think I follow till this point. I think what you are saying is
> > that following is our goal in a stable system.
> > 
> > 	task_ratelimit = write_bw/N				(6)
> > 
> > So we measure the write_bw of a bdi over a period of time and use that
> > as feedback loop to modify bdi->dirty_ratelimit which inturn modifies
> > task_ratelimit and hence we achieve the balance. So we will start with
> > some arbitrary task limit say task_ratelimit_0, and modify that limit
> > over a period of time based on our feedback loop to achieve a balanced
> > system. And following seems to be the formula.
> > 					    write_bw
> > 	task_ratelimit = task_ratelimit_0 * ------- 		(7)
> > 					    dirty_rate
> > 
> > Now I also understand that by using (2) and (3), you proved that
> > how (7) will lead to (6) and that is our deisred goal. 
> 
> That's right.
> 
> > > 
> > > .............................................................................
> > > 
> > > Now let's revisit (1). Since balance_dirty_pages() chooses to execute
> > > the ratelimit
> > > 
> > >         task_ratelimit = task_ratelimit_0
> > >                        = dirty_ratelimit * pos_ratio                    (5)
> > > 
> > 
> > So balance_drity_pages() chose to take into account pos_ratio() also
> > because for various reason like just taking into account only bandwidth
> > variation as feedback was not sufficient. So we also took pos_ratio
> > into account which in-trun is dependent on gloabal dirty pages and per
> > bdi dirty_pages/rate.
> 
> That's right so far. balance_drity_pages() needs to do dirty position
> control, so used formula (5).
> 
> > So we refined the formula for calculating a tasks's effective rate
> > over a period of time to following.
> > 					    write_bw
> > 	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
> > 					    dirty_rate
> > 
> 
> That's not true. It should still be formula (7) when
> balance_drity_pages() considers pos_ratio.

Why it is not true? If I do some math, it sounds right. Let me summarize
my understanding again.

- In a steady state stable system, we want dirty_bw = write_bw, IOW.
 
  dirty_bw/write_bw = 1  		(1)

  If we can achieve above then that means we are throttling tasks at
  just right rate.

Or
-  dirty_bw  == write_bw
   N * task_ratelimit == write_bw
   task_ratelimit =  write_bw/N         (2)

  So as long as we can come up with a system where balance_dirty_pages()
  calculates task_ratelimit to be write_bw/N, we should be fine.

- But this does not take care of imbalances. So if system goes out of
  balance before feedback loop kicks in and dirty rate shoots up, then
  cache size will grow and number of dirty pages will shoot up. Hence
  we brought in the notion of position ratio where we also vary a 
  tasks's dirty ratelimit based on number of dirty pages. So our
  effective formula became.

  task_ratelimit = write_bw/N * pos_ratio     (3)

  So as long as we meet (3), we should reach to stable state.

-  But here N is unknown in advance so balance_drity_pages() can not make
   use of this formula directly. But write_bw and dirty_bw from previous
   200ms are known. So following can replace (3).

				       write_bw
   task_ratelimit = task_ratelimit_0 * --------- * pos_ratio      (4)
					dirty_bw	

   dirty_bw = tas_ratelimit_0 * N                (5)

   Substitute (5) in (4)

   task_ratelimit = write_bw/N * pos_ratio      (6)

   (6) is same as (3) which has been derived from (4) and that means at any
   given point of time (4) can be used by balance_drity_pages() to calculate
   a tasks's throttling rate.

- Now going back to (4). Because we have a feedback loop where we
  continuously update a previous number based on feedback, we can track
  previous value in bdi->dirty_ratelimit.

				       write_bw
   task_ratelimit = task_ratelimit_0 * --------- * pos_ratio 
					dirty_bw	

   Or

   task_ratelimit = bdi->dirty_ratelimit * pos_ratio         (7)

   where
					    write_bw	
  bdi->dirty_ratelimit = task_ratelimit_0 * ---------
					    dirty_bw
  
  Because task_ratelimit_0 is initial value to begin with and we will
  keep on coming with new value every 200ms, we should be able to write
  above as follows.

						      write_bw
  bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
						      dirty_bw

  Effectively we start with an initial value of task_ratelimit_0 and
  then keep on updating it based on rate change feedback every 200ms.

  To summarize,

  We need to achieve (3) for a balanced system. Because we don't know the
  value of N in advance, we can use (4) to achieve effect of (3). So we
  start with a default value of task_ratelimit_0 and update that every
  200ms based on how write and dirty rate on device is changing (8). We also
  further refine that rate by pos_ratio so that any variations in number
  of dirty pages due to temporary imbalances in the system can be
  accounted for (7).

I see that you also use (7). I think only contention point is how
(8) is perceived. So can you please explain why do you think that
above calculation or (9) is wrong.

I can kind of understand that you have done various adjustments to keep the
task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that
I am not able to understand your calculations in updating bdi->dirty_ratelimit.  
Thanks
Vivek

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-22 17:22             ` Vivek Goyal
  0 siblings, 0 replies; 98+ messages in thread
From: Vivek Goyal @ 2011-08-22 17:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Sun, Aug 21, 2011 at 11:46:58AM +0800, Wu Fengguang wrote:
> On Sat, Aug 20, 2011 at 03:00:37AM +0800, Vivek Goyal wrote:
> > On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote:
> > > Hi Vivek,
> > > 
> > > > > +		base_rate = bdi->dirty_ratelimit;
> > > > > +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > > > > +					       background_thresh, nr_dirty,
> > > > > +					       bdi_thresh, bdi_dirty);
> > > > > +		if (unlikely(pos_ratio == 0)) {
> > > > > +			pause = MAX_PAUSE;
> > > > > +			goto pause;
> > > > >  		}
> > > > > +		task_ratelimit = (u64)base_rate *
> > > > > +					pos_ratio >> RATELIMIT_CALC_SHIFT;
> > > > 
> > > > Hi Fenguaang,
> > > > 
> > > > I am little confused here. I see that you have already taken pos_ratio
> > > > into account in bdi_update_dirty_ratelimit() and wondering why to take
> > > > that into account again in balance_diry_pages().
> > > > 
> > > > We calculated the pos_rate and balanced_rate and adjusted the
> > > > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().
> > > 
> > > Good question. There are some inter-dependencies in the calculation,
> > > and the dependency chain is the opposite to the one in your mind:
> > > balance_dirty_pages() used pos_ratio in the first place, so that
> > > bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation
> > > of the balanced dirty rate, too.
> > > 
> > > Let's return to how the balanced dirty rate is estimated. Please pay
> > > special attention to the last paragraphs below the "......" line.
> > > 
> > > Start by throttling each dd task at rate
> > > 
> > >         task_ratelimit = task_ratelimit_0                               (1)
> > >                          (any non-zero initial value is OK)
> > > 
> > > After 200ms, we measured
> > > 
> > >         dirty_rate = # of pages dirtied by all dd's / 200ms
> > >         write_bw   = # of pages written to the disk / 200ms
> > > 
> > > For the aggressive dd dirtiers, the equality holds
> > > 
> > >         dirty_rate == N * task_rate
> > >                    == N * task_ratelimit
> > >                    == N * task_ratelimit_0                              (2)
> > > Or     
> > >         task_ratelimit_0 = dirty_rate / N                               (3)
> > > 
> > > Now we conclude that the balanced task ratelimit can be estimated by
> > > 
> > >         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (4)
> > > 
> > > Because with (2) and (3), (4) yields the desired equality (1):
> > > 
> > >         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> > >                       == write_bw / N
> > 
> > Hi Fengguang,
> > 
> > Following is my understanding. Please correct me where I got it wrong.
> > 
> > Ok, I think I follow till this point. I think what you are saying is
> > that following is our goal in a stable system.
> > 
> > 	task_ratelimit = write_bw/N				(6)
> > 
> > So we measure the write_bw of a bdi over a period of time and use that
> > as feedback loop to modify bdi->dirty_ratelimit which inturn modifies
> > task_ratelimit and hence we achieve the balance. So we will start with
> > some arbitrary task limit say task_ratelimit_0, and modify that limit
> > over a period of time based on our feedback loop to achieve a balanced
> > system. And following seems to be the formula.
> > 					    write_bw
> > 	task_ratelimit = task_ratelimit_0 * ------- 		(7)
> > 					    dirty_rate
> > 
> > Now I also understand that by using (2) and (3), you proved that
> > how (7) will lead to (6) and that is our deisred goal. 
> 
> That's right.
> 
> > > 
> > > .............................................................................
> > > 
> > > Now let's revisit (1). Since balance_dirty_pages() chooses to execute
> > > the ratelimit
> > > 
> > >         task_ratelimit = task_ratelimit_0
> > >                        = dirty_ratelimit * pos_ratio                    (5)
> > > 
> > 
> > So balance_drity_pages() chose to take into account pos_ratio() also
> > because for various reason like just taking into account only bandwidth
> > variation as feedback was not sufficient. So we also took pos_ratio
> > into account which in-trun is dependent on gloabal dirty pages and per
> > bdi dirty_pages/rate.
> 
> That's right so far. balance_drity_pages() needs to do dirty position
> control, so used formula (5).
> 
> > So we refined the formula for calculating a tasks's effective rate
> > over a period of time to following.
> > 					    write_bw
> > 	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
> > 					    dirty_rate
> > 
> 
> That's not true. It should still be formula (7) when
> balance_drity_pages() considers pos_ratio.

Why it is not true? If I do some math, it sounds right. Let me summarize
my understanding again.

- In a steady state stable system, we want dirty_bw = write_bw, IOW.
 
  dirty_bw/write_bw = 1  		(1)

  If we can achieve above then that means we are throttling tasks at
  just right rate.

Or
-  dirty_bw  == write_bw
   N * task_ratelimit == write_bw
   task_ratelimit =  write_bw/N         (2)

  So as long as we can come up with a system where balance_dirty_pages()
  calculates task_ratelimit to be write_bw/N, we should be fine.

- But this does not take care of imbalances. So if system goes out of
  balance before feedback loop kicks in and dirty rate shoots up, then
  cache size will grow and number of dirty pages will shoot up. Hence
  we brought in the notion of position ratio where we also vary a 
  tasks's dirty ratelimit based on number of dirty pages. So our
  effective formula became.

  task_ratelimit = write_bw/N * pos_ratio     (3)

  So as long as we meet (3), we should reach to stable state.

-  But here N is unknown in advance so balance_drity_pages() can not make
   use of this formula directly. But write_bw and dirty_bw from previous
   200ms are known. So following can replace (3).

				       write_bw
   task_ratelimit = task_ratelimit_0 * --------- * pos_ratio      (4)
					dirty_bw	

   dirty_bw = tas_ratelimit_0 * N                (5)

   Substitute (5) in (4)

   task_ratelimit = write_bw/N * pos_ratio      (6)

   (6) is same as (3) which has been derived from (4) and that means at any
   given point of time (4) can be used by balance_drity_pages() to calculate
   a tasks's throttling rate.

- Now going back to (4). Because we have a feedback loop where we
  continuously update a previous number based on feedback, we can track
  previous value in bdi->dirty_ratelimit.

				       write_bw
   task_ratelimit = task_ratelimit_0 * --------- * pos_ratio 
					dirty_bw	

   Or

   task_ratelimit = bdi->dirty_ratelimit * pos_ratio         (7)

   where
					    write_bw	
  bdi->dirty_ratelimit = task_ratelimit_0 * ---------
					    dirty_bw
  
  Because task_ratelimit_0 is initial value to begin with and we will
  keep on coming with new value every 200ms, we should be able to write
  above as follows.

						      write_bw
  bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
						      dirty_bw

  Effectively we start with an initial value of task_ratelimit_0 and
  then keep on updating it based on rate change feedback every 200ms.

  To summarize,

  We need to achieve (3) for a balanced system. Because we don't know the
  value of N in advance, we can use (4) to achieve effect of (3). So we
  start with a default value of task_ratelimit_0 and update that every
  200ms based on how write and dirty rate on device is changing (8). We also
  further refine that rate by pos_ratio so that any variations in number
  of dirty pages due to temporary imbalances in the system can be
  accounted for (7).

I see that you also use (7). I think only contention point is how
(8) is perceived. So can you please explain why do you think that
above calculation or (9) is wrong.

I can kind of understand that you have done various adjustments to keep the
task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that
I am not able to understand your calculations in updating bdi->dirty_ratelimit.  
Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-22 17:22             ` Vivek Goyal
@ 2011-08-23  1:07               ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-23  1:07 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 01:22:30AM +0800, Vivek Goyal wrote:
> On Sun, Aug 21, 2011 at 11:46:58AM +0800, Wu Fengguang wrote:
> > On Sat, Aug 20, 2011 at 03:00:37AM +0800, Vivek Goyal wrote:
> > > On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote:
> > > > Hi Vivek,
> > > > 
> > > > > > +		base_rate = bdi->dirty_ratelimit;
> > > > > > +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > > > > > +					       background_thresh, nr_dirty,
> > > > > > +					       bdi_thresh, bdi_dirty);
> > > > > > +		if (unlikely(pos_ratio == 0)) {
> > > > > > +			pause = MAX_PAUSE;
> > > > > > +			goto pause;
> > > > > >  		}
> > > > > > +		task_ratelimit = (u64)base_rate *
> > > > > > +					pos_ratio >> RATELIMIT_CALC_SHIFT;
> > > > > 
> > > > > Hi Fenguaang,
> > > > > 
> > > > > I am little confused here. I see that you have already taken pos_ratio
> > > > > into account in bdi_update_dirty_ratelimit() and wondering why to take
> > > > > that into account again in balance_diry_pages().
> > > > > 
> > > > > We calculated the pos_rate and balanced_rate and adjusted the
> > > > > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().
> > > > 
> > > > Good question. There are some inter-dependencies in the calculation,
> > > > and the dependency chain is the opposite to the one in your mind:
> > > > balance_dirty_pages() used pos_ratio in the first place, so that
> > > > bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation
> > > > of the balanced dirty rate, too.
> > > > 
> > > > Let's return to how the balanced dirty rate is estimated. Please pay
> > > > special attention to the last paragraphs below the "......" line.
> > > > 
> > > > Start by throttling each dd task at rate
> > > > 
> > > >         task_ratelimit = task_ratelimit_0                               (1)
> > > >                          (any non-zero initial value is OK)
> > > > 
> > > > After 200ms, we measured
> > > > 
> > > >         dirty_rate = # of pages dirtied by all dd's / 200ms
> > > >         write_bw   = # of pages written to the disk / 200ms
> > > > 
> > > > For the aggressive dd dirtiers, the equality holds
> > > > 
> > > >         dirty_rate == N * task_rate
> > > >                    == N * task_ratelimit
> > > >                    == N * task_ratelimit_0                              (2)
> > > > Or     
> > > >         task_ratelimit_0 = dirty_rate / N                               (3)
> > > > 
> > > > Now we conclude that the balanced task ratelimit can be estimated by
> > > > 
> > > >         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (4)
> > > > 
> > > > Because with (2) and (3), (4) yields the desired equality (1):
> > > > 
> > > >         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> > > >                       == write_bw / N
> > > 
> > > Hi Fengguang,
> > > 
> > > Following is my understanding. Please correct me where I got it wrong.
> > > 
> > > Ok, I think I follow till this point. I think what you are saying is
> > > that following is our goal in a stable system.
> > > 
> > > 	task_ratelimit = write_bw/N				(6)
> > > 
> > > So we measure the write_bw of a bdi over a period of time and use that
> > > as feedback loop to modify bdi->dirty_ratelimit which inturn modifies
> > > task_ratelimit and hence we achieve the balance. So we will start with
> > > some arbitrary task limit say task_ratelimit_0, and modify that limit
> > > over a period of time based on our feedback loop to achieve a balanced
> > > system. And following seems to be the formula.
> > > 					    write_bw
> > > 	task_ratelimit = task_ratelimit_0 * ------- 		(7)
> > > 					    dirty_rate
> > > 
> > > Now I also understand that by using (2) and (3), you proved that
> > > how (7) will lead to (6) and that is our deisred goal. 
> > 
> > That's right.
> > 
> > > > 
> > > > .............................................................................
> > > > 
> > > > Now let's revisit (1). Since balance_dirty_pages() chooses to execute
> > > > the ratelimit
> > > > 
> > > >         task_ratelimit = task_ratelimit_0
> > > >                        = dirty_ratelimit * pos_ratio                    (5)
> > > > 
> > > 
> > > So balance_drity_pages() chose to take into account pos_ratio() also
> > > because for various reason like just taking into account only bandwidth
> > > variation as feedback was not sufficient. So we also took pos_ratio
> > > into account which in-trun is dependent on gloabal dirty pages and per
> > > bdi dirty_pages/rate.
> > 
> > That's right so far. balance_drity_pages() needs to do dirty position
> > control, so used formula (5).
> > 
> > > So we refined the formula for calculating a tasks's effective rate
> > > over a period of time to following.
> > > 					    write_bw
> > > 	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
> > > 					    dirty_rate
> > > 
> > 
> > That's not true. It should still be formula (7) when
> > balance_drity_pages() considers pos_ratio.
> 
> Why it is not true? If I do some math, it sounds right. Let me summarize
> my understanding again.

Ah sorry! (9) actually holds true, as made clear by your below reasoning.

> - In a steady state stable system, we want dirty_bw = write_bw, IOW.
>  
>   dirty_bw/write_bw = 1  		(1)
> 
>   If we can achieve above then that means we are throttling tasks at
>   just right rate.
> 
> Or
> -  dirty_bw  == write_bw
>    N * task_ratelimit == write_bw
>    task_ratelimit =  write_bw/N         (2)
> 
>   So as long as we can come up with a system where balance_dirty_pages()
>   calculates task_ratelimit to be write_bw/N, we should be fine.

Right.

> - But this does not take care of imbalances. So if system goes out of
>   balance before feedback loop kicks in and dirty rate shoots up, then
>   cache size will grow and number of dirty pages will shoot up. Hence
>   we brought in the notion of position ratio where we also vary a 
>   tasks's dirty ratelimit based on number of dirty pages. So our
>   effective formula became.
> 
>   task_ratelimit = write_bw/N * pos_ratio     (3)
> 
>   So as long as we meet (3), we should reach to stable state.

Right.

> -  But here N is unknown in advance so balance_drity_pages() can not make
>    use of this formula directly. But write_bw and dirty_bw from previous
>    200ms are known. So following can replace (3).
> 
> 				       write_bw
>    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio      (4)
> 					dirty_bw	
> 
>    dirty_bw = task_ratelimit_0 * N                (5)
> 
>    Substitute (5) in (4)
> 
>    task_ratelimit = write_bw/N * pos_ratio      (6)
> 
>    (6) is same as (3) which has been derived from (4) and that means at any
>    given point of time (4) can be used by balance_drity_pages() to calculate
>    a tasks's throttling rate.

Right. Sorry what's in my mind was

                                       write_bw
    balanced_rate = task_ratelimit_0 * --------
                                       dirty_bw        

    task_ratelimit = balanced_rate * pos_ratio

which is effective the same to your combined equation (4).

> - Now going back to (4). Because we have a feedback loop where we
>   continuously update a previous number based on feedback, we can track
>   previous value in bdi->dirty_ratelimit.
> 
> 				       write_bw
>    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio 
> 					dirty_bw	
> 
>    Or
> 
>    task_ratelimit = bdi->dirty_ratelimit * pos_ratio         (7)
> 
>    where
> 					    write_bw	
>   bdi->dirty_ratelimit = task_ratelimit_0 * ---------
> 					    dirty_bw

Right.

>   Because task_ratelimit_0 is initial value to begin with and we will
>   keep on coming with new value every 200ms, we should be able to write
>   above as follows.
> 
> 						      write_bw
>   bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
> 						      dirty_bw
> 
>   Effectively we start with an initial value of task_ratelimit_0 and
>   then keep on updating it based on rate change feedback every 200ms.

Right.

>   To summarize,
> 
>   We need to achieve (3) for a balanced system. Because we don't know the
>   value of N in advance, we can use (4) to achieve effect of (3). So we
>   start with a default value of task_ratelimit_0 and update that every
>   200ms based on how write and dirty rate on device is changing (8). We also
>   further refine that rate by pos_ratio so that any variations in number
>   of dirty pages due to temporary imbalances in the system can be
>   accounted for (7).
> 
> I see that you also use (7). I think only contention point is how
> (8) is perceived. So can you please explain why do you think that
> above calculation or (9) is wrong.

There is no contention point and (9) is right..Sorry it's my fault.
We are well aligned in the above reasoning :)

> I can kind of understand that you have done various adjustments to keep the
> task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that
> I am not able to understand your calculations in updating bdi->dirty_ratelimit.  

You mean the below chunk of code? Which is effectively the same as this _one_
line of code

        bdi->dirty_ratelimit = balanced_rate;

except for doing some tricks (conditional update and limiting step size) to
stabilize bdi->dirty_ratelimit:

        unsigned long base_rate = bdi->dirty_ratelimit;

        /*
         * Use a different name for the same value to distinguish the concepts.
         * Only the relative value of
         *     (pos_rate - base_rate) = (pos_ratio - 1) * base_rate
         * will be used below, which reflects the direction and size of dirty
         * position error.
         */
        pos_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT;

        /*
         * dirty_ratelimit will follow balanced_rate iff pos_rate is on the
         * same side of dirty_ratelimit, too.
         * For example,
         * - (base_rate > balanced_rate) => dirty rate is too high
         * - (base_rate > pos_rate)      => dirty pages are above setpoint
         * so lowering base_rate will help meet both the position and rate
         * control targets. Otherwise, don't update base_rate if it will only
         * help meet the rate target. After all, what the users ultimately feel
         * and care are stable dirty rate and small position error.  This
         * update policy can also prevent dirty_ratelimit from being driven
         * away by possible systematic errors in balanced_rate.
         *
         * |base_rate - pos_rate| is also used to limit the step size for
         * filtering out the sigular points of balanced_rate, which keeps
         * jumping around randomly and can even leap far away at times due to
         * the small 200ms estimation period of dirty_rate (we want to keep
         * that period small to reduce time lags).
         */
        delta = 0;
        if (base_rate < balanced_rate) {
                if (base_rate < pos_rate)
                        delta = min(balanced_rate, pos_rate) - base_rate;
        } else {
                if (base_rate > pos_rate)
                        delta = base_rate - max(balanced_rate, pos_rate);
        }
       
        /*
         * Don't pursue 100% rate matching. It's impossible since the balanced
         * rate itself is constantly fluctuating. So decrease the track speed
         * when it gets close to the target. Helps eliminate pointless tremors.
         */
        delta >>= base_rate / (8 * delta + 1);
        /*
         * Limit the tracking speed to avoid overshooting.
         */
        delta = (delta + 7) / 8;

        if (base_rate < balanced_rate)
                base_rate += delta;
        else   
                base_rate -= delta;

        bdi->dirty_ratelimit = max(base_rate, 1UL);

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-23  1:07               ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-23  1:07 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 01:22:30AM +0800, Vivek Goyal wrote:
> On Sun, Aug 21, 2011 at 11:46:58AM +0800, Wu Fengguang wrote:
> > On Sat, Aug 20, 2011 at 03:00:37AM +0800, Vivek Goyal wrote:
> > > On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote:
> > > > Hi Vivek,
> > > > 
> > > > > > +		base_rate = bdi->dirty_ratelimit;
> > > > > > +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > > > > > +					       background_thresh, nr_dirty,
> > > > > > +					       bdi_thresh, bdi_dirty);
> > > > > > +		if (unlikely(pos_ratio == 0)) {
> > > > > > +			pause = MAX_PAUSE;
> > > > > > +			goto pause;
> > > > > >  		}
> > > > > > +		task_ratelimit = (u64)base_rate *
> > > > > > +					pos_ratio >> RATELIMIT_CALC_SHIFT;
> > > > > 
> > > > > Hi Fenguaang,
> > > > > 
> > > > > I am little confused here. I see that you have already taken pos_ratio
> > > > > into account in bdi_update_dirty_ratelimit() and wondering why to take
> > > > > that into account again in balance_diry_pages().
> > > > > 
> > > > > We calculated the pos_rate and balanced_rate and adjusted the
> > > > > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().
> > > > 
> > > > Good question. There are some inter-dependencies in the calculation,
> > > > and the dependency chain is the opposite to the one in your mind:
> > > > balance_dirty_pages() used pos_ratio in the first place, so that
> > > > bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation
> > > > of the balanced dirty rate, too.
> > > > 
> > > > Let's return to how the balanced dirty rate is estimated. Please pay
> > > > special attention to the last paragraphs below the "......" line.
> > > > 
> > > > Start by throttling each dd task at rate
> > > > 
> > > >         task_ratelimit = task_ratelimit_0                               (1)
> > > >                          (any non-zero initial value is OK)
> > > > 
> > > > After 200ms, we measured
> > > > 
> > > >         dirty_rate = # of pages dirtied by all dd's / 200ms
> > > >         write_bw   = # of pages written to the disk / 200ms
> > > > 
> > > > For the aggressive dd dirtiers, the equality holds
> > > > 
> > > >         dirty_rate == N * task_rate
> > > >                    == N * task_ratelimit
> > > >                    == N * task_ratelimit_0                              (2)
> > > > Or     
> > > >         task_ratelimit_0 = dirty_rate / N                               (3)
> > > > 
> > > > Now we conclude that the balanced task ratelimit can be estimated by
> > > > 
> > > >         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (4)
> > > > 
> > > > Because with (2) and (3), (4) yields the desired equality (1):
> > > > 
> > > >         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> > > >                       == write_bw / N
> > > 
> > > Hi Fengguang,
> > > 
> > > Following is my understanding. Please correct me where I got it wrong.
> > > 
> > > Ok, I think I follow till this point. I think what you are saying is
> > > that following is our goal in a stable system.
> > > 
> > > 	task_ratelimit = write_bw/N				(6)
> > > 
> > > So we measure the write_bw of a bdi over a period of time and use that
> > > as feedback loop to modify bdi->dirty_ratelimit which inturn modifies
> > > task_ratelimit and hence we achieve the balance. So we will start with
> > > some arbitrary task limit say task_ratelimit_0, and modify that limit
> > > over a period of time based on our feedback loop to achieve a balanced
> > > system. And following seems to be the formula.
> > > 					    write_bw
> > > 	task_ratelimit = task_ratelimit_0 * ------- 		(7)
> > > 					    dirty_rate
> > > 
> > > Now I also understand that by using (2) and (3), you proved that
> > > how (7) will lead to (6) and that is our deisred goal. 
> > 
> > That's right.
> > 
> > > > 
> > > > .............................................................................
> > > > 
> > > > Now let's revisit (1). Since balance_dirty_pages() chooses to execute
> > > > the ratelimit
> > > > 
> > > >         task_ratelimit = task_ratelimit_0
> > > >                        = dirty_ratelimit * pos_ratio                    (5)
> > > > 
> > > 
> > > So balance_drity_pages() chose to take into account pos_ratio() also
> > > because for various reason like just taking into account only bandwidth
> > > variation as feedback was not sufficient. So we also took pos_ratio
> > > into account which in-trun is dependent on gloabal dirty pages and per
> > > bdi dirty_pages/rate.
> > 
> > That's right so far. balance_drity_pages() needs to do dirty position
> > control, so used formula (5).
> > 
> > > So we refined the formula for calculating a tasks's effective rate
> > > over a period of time to following.
> > > 					    write_bw
> > > 	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
> > > 					    dirty_rate
> > > 
> > 
> > That's not true. It should still be formula (7) when
> > balance_drity_pages() considers pos_ratio.
> 
> Why it is not true? If I do some math, it sounds right. Let me summarize
> my understanding again.

Ah sorry! (9) actually holds true, as made clear by your below reasoning.

> - In a steady state stable system, we want dirty_bw = write_bw, IOW.
>  
>   dirty_bw/write_bw = 1  		(1)
> 
>   If we can achieve above then that means we are throttling tasks at
>   just right rate.
> 
> Or
> -  dirty_bw  == write_bw
>    N * task_ratelimit == write_bw
>    task_ratelimit =  write_bw/N         (2)
> 
>   So as long as we can come up with a system where balance_dirty_pages()
>   calculates task_ratelimit to be write_bw/N, we should be fine.

Right.

> - But this does not take care of imbalances. So if system goes out of
>   balance before feedback loop kicks in and dirty rate shoots up, then
>   cache size will grow and number of dirty pages will shoot up. Hence
>   we brought in the notion of position ratio where we also vary a 
>   tasks's dirty ratelimit based on number of dirty pages. So our
>   effective formula became.
> 
>   task_ratelimit = write_bw/N * pos_ratio     (3)
> 
>   So as long as we meet (3), we should reach to stable state.

Right.

> -  But here N is unknown in advance so balance_drity_pages() can not make
>    use of this formula directly. But write_bw and dirty_bw from previous
>    200ms are known. So following can replace (3).
> 
> 				       write_bw
>    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio      (4)
> 					dirty_bw	
> 
>    dirty_bw = task_ratelimit_0 * N                (5)
> 
>    Substitute (5) in (4)
> 
>    task_ratelimit = write_bw/N * pos_ratio      (6)
> 
>    (6) is same as (3) which has been derived from (4) and that means at any
>    given point of time (4) can be used by balance_drity_pages() to calculate
>    a tasks's throttling rate.

Right. Sorry what's in my mind was

                                       write_bw
    balanced_rate = task_ratelimit_0 * --------
                                       dirty_bw        

    task_ratelimit = balanced_rate * pos_ratio

which is effective the same to your combined equation (4).

> - Now going back to (4). Because we have a feedback loop where we
>   continuously update a previous number based on feedback, we can track
>   previous value in bdi->dirty_ratelimit.
> 
> 				       write_bw
>    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio 
> 					dirty_bw	
> 
>    Or
> 
>    task_ratelimit = bdi->dirty_ratelimit * pos_ratio         (7)
> 
>    where
> 					    write_bw	
>   bdi->dirty_ratelimit = task_ratelimit_0 * ---------
> 					    dirty_bw

Right.

>   Because task_ratelimit_0 is initial value to begin with and we will
>   keep on coming with new value every 200ms, we should be able to write
>   above as follows.
> 
> 						      write_bw
>   bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
> 						      dirty_bw
> 
>   Effectively we start with an initial value of task_ratelimit_0 and
>   then keep on updating it based on rate change feedback every 200ms.

Right.

>   To summarize,
> 
>   We need to achieve (3) for a balanced system. Because we don't know the
>   value of N in advance, we can use (4) to achieve effect of (3). So we
>   start with a default value of task_ratelimit_0 and update that every
>   200ms based on how write and dirty rate on device is changing (8). We also
>   further refine that rate by pos_ratio so that any variations in number
>   of dirty pages due to temporary imbalances in the system can be
>   accounted for (7).
> 
> I see that you also use (7). I think only contention point is how
> (8) is perceived. So can you please explain why do you think that
> above calculation or (9) is wrong.

There is no contention point and (9) is right..Sorry it's my fault.
We are well aligned in the above reasoning :)

> I can kind of understand that you have done various adjustments to keep the
> task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that
> I am not able to understand your calculations in updating bdi->dirty_ratelimit.  

You mean the below chunk of code? Which is effectively the same as this _one_
line of code

        bdi->dirty_ratelimit = balanced_rate;

except for doing some tricks (conditional update and limiting step size) to
stabilize bdi->dirty_ratelimit:

        unsigned long base_rate = bdi->dirty_ratelimit;

        /*
         * Use a different name for the same value to distinguish the concepts.
         * Only the relative value of
         *     (pos_rate - base_rate) = (pos_ratio - 1) * base_rate
         * will be used below, which reflects the direction and size of dirty
         * position error.
         */
        pos_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT;

        /*
         * dirty_ratelimit will follow balanced_rate iff pos_rate is on the
         * same side of dirty_ratelimit, too.
         * For example,
         * - (base_rate > balanced_rate) => dirty rate is too high
         * - (base_rate > pos_rate)      => dirty pages are above setpoint
         * so lowering base_rate will help meet both the position and rate
         * control targets. Otherwise, don't update base_rate if it will only
         * help meet the rate target. After all, what the users ultimately feel
         * and care are stable dirty rate and small position error.  This
         * update policy can also prevent dirty_ratelimit from being driven
         * away by possible systematic errors in balanced_rate.
         *
         * |base_rate - pos_rate| is also used to limit the step size for
         * filtering out the sigular points of balanced_rate, which keeps
         * jumping around randomly and can even leap far away at times due to
         * the small 200ms estimation period of dirty_rate (we want to keep
         * that period small to reduce time lags).
         */
        delta = 0;
        if (base_rate < balanced_rate) {
                if (base_rate < pos_rate)
                        delta = min(balanced_rate, pos_rate) - base_rate;
        } else {
                if (base_rate > pos_rate)
                        delta = base_rate - max(balanced_rate, pos_rate);
        }
       
        /*
         * Don't pursue 100% rate matching. It's impossible since the balanced
         * rate itself is constantly fluctuating. So decrease the track speed
         * when it gets close to the target. Helps eliminate pointless tremors.
         */
        delta >>= base_rate / (8 * delta + 1);
        /*
         * Limit the tracking speed to avoid overshooting.
         */
        delta = (delta + 7) / 8;

        if (base_rate < balanced_rate)
                base_rate += delta;
        else   
                base_rate -= delta;

        bdi->dirty_ratelimit = max(base_rate, 1UL);

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-23  1:07               ` Wu Fengguang
@ 2011-08-23  3:53                 ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-23  3:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

> >   Because task_ratelimit_0 is initial value to begin with and we will
> >   keep on coming with new value every 200ms, we should be able to write
> >   above as follows.
> > 
> > 						      write_bw
> >   bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
> > 						      dirty_bw
> > 
> >   Effectively we start with an initial value of task_ratelimit_0 and
> >   then keep on updating it based on rate change feedback every 200ms.

Ah sorry, based on the reply to Peter, there is no inherent dependency
between balanced_rate_n and balanced_rate_(n-1). bdi->dirty_ratelimit does
track balanced_rate in small steps, and hence will have some relationship
with its previous value other than equation (8).

So, although you may conduct equation (8) for balanced_rate, we'd
better not understand things in that way. Keep this fundamental
formula in mind and don't try to complicate it:

        balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-23  3:53                 ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-23  3:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

> >   Because task_ratelimit_0 is initial value to begin with and we will
> >   keep on coming with new value every 200ms, we should be able to write
> >   above as follows.
> > 
> > 						      write_bw
> >   bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
> > 						      dirty_bw
> > 
> >   Effectively we start with an initial value of task_ratelimit_0 and
> >   then keep on updating it based on rate change feedback every 200ms.

Ah sorry, based on the reply to Peter, there is no inherent dependency
between balanced_rate_n and balanced_rate_(n-1). bdi->dirty_ratelimit does
track balanced_rate in small steps, and hence will have some relationship
with its previous value other than equation (8).

So, although you may conduct equation (8) for balanced_rate, we'd
better not understand things in that way. Keep this fundamental
formula in mind and don't try to complicate it:

        balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-23  1:07               ` Wu Fengguang
@ 2011-08-23 13:53                 ` Vivek Goyal
  -1 siblings, 0 replies; 98+ messages in thread
From: Vivek Goyal @ 2011-08-23 13:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 09:07:21AM +0800, Wu Fengguang wrote:

[..]
> > > > So we refined the formula for calculating a tasks's effective rate
> > > > over a period of time to following.
> > > > 					    write_bw
> > > > 	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
> > > > 					    dirty_rate
> > > > 
> > > 
> > > That's not true. It should still be formula (7) when
> > > balance_drity_pages() considers pos_ratio.
> > 
> > Why it is not true? If I do some math, it sounds right. Let me summarize
> > my understanding again.
> 
> Ah sorry! (9) actually holds true, as made clear by your below reasoning.
> 
> > - In a steady state stable system, we want dirty_bw = write_bw, IOW.
> >  
> >   dirty_bw/write_bw = 1  		(1)
> > 
> >   If we can achieve above then that means we are throttling tasks at
> >   just right rate.
> > 
> > Or
> > -  dirty_bw  == write_bw
> >    N * task_ratelimit == write_bw
> >    task_ratelimit =  write_bw/N         (2)
> > 
> >   So as long as we can come up with a system where balance_dirty_pages()
> >   calculates task_ratelimit to be write_bw/N, we should be fine.
> 
> Right.
> 
> > - But this does not take care of imbalances. So if system goes out of
> >   balance before feedback loop kicks in and dirty rate shoots up, then
> >   cache size will grow and number of dirty pages will shoot up. Hence
> >   we brought in the notion of position ratio where we also vary a 
> >   tasks's dirty ratelimit based on number of dirty pages. So our
> >   effective formula became.
> > 
> >   task_ratelimit = write_bw/N * pos_ratio     (3)
> > 
> >   So as long as we meet (3), we should reach to stable state.
> 
> Right.
> 
> > -  But here N is unknown in advance so balance_drity_pages() can not make
> >    use of this formula directly. But write_bw and dirty_bw from previous
> >    200ms are known. So following can replace (3).
> > 
> > 				       write_bw
> >    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio      (4)
> > 					dirty_bw	
> > 
> >    dirty_bw = task_ratelimit_0 * N                (5)
> > 
> >    Substitute (5) in (4)
> > 
> >    task_ratelimit = write_bw/N * pos_ratio      (6)
> > 
> >    (6) is same as (3) which has been derived from (4) and that means at any
> >    given point of time (4) can be used by balance_drity_pages() to calculate
> >    a tasks's throttling rate.
> 
> Right. Sorry what's in my mind was
> 
>                                        write_bw
>     balanced_rate = task_ratelimit_0 * --------
>                                        dirty_bw        
> 
>     task_ratelimit = balanced_rate * pos_ratio
> 
> which is effective the same to your combined equation (4).
> 
> > - Now going back to (4). Because we have a feedback loop where we
> >   continuously update a previous number based on feedback, we can track
> >   previous value in bdi->dirty_ratelimit.
> > 
> > 				       write_bw
> >    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio 
> > 					dirty_bw	
> > 
> >    Or
> > 
> >    task_ratelimit = bdi->dirty_ratelimit * pos_ratio         (7)
> > 
> >    where
> > 					    write_bw	
> >   bdi->dirty_ratelimit = task_ratelimit_0 * ---------
> > 					    dirty_bw
> 
> Right.
> 
> >   Because task_ratelimit_0 is initial value to begin with and we will
> >   keep on coming with new value every 200ms, we should be able to write
> >   above as follows.
> > 
> > 						      write_bw
> >   bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
> > 						      dirty_bw
> > 
> >   Effectively we start with an initial value of task_ratelimit_0 and
> >   then keep on updating it based on rate change feedback every 200ms.
> 
> Right.
> 
> >   To summarize,
> > 
> >   We need to achieve (3) for a balanced system. Because we don't know the
> >   value of N in advance, we can use (4) to achieve effect of (3). So we
> >   start with a default value of task_ratelimit_0 and update that every
> >   200ms based on how write and dirty rate on device is changing (8). We also
> >   further refine that rate by pos_ratio so that any variations in number
> >   of dirty pages due to temporary imbalances in the system can be
> >   accounted for (7).
> > 
> > I see that you also use (7). I think only contention point is how
> > (8) is perceived. So can you please explain why do you think that
> > above calculation or (9) is wrong.
> 
> There is no contention point and (9) is right..Sorry it's my fault.
> We are well aligned in the above reasoning :)

Great. Now we are on same page now at least till this point.

> 
> > I can kind of understand that you have done various adjustments to keep the
> > task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that
> > I am not able to understand your calculations in updating bdi->dirty_ratelimit.  
> 
> You mean the below chunk of code? Which is effectively the same as this _one_
> line of code
> 
>         bdi->dirty_ratelimit = balanced_rate;
> 
> except for doing some tricks (conditional update and limiting step size) to
> stabilize bdi->dirty_ratelimit:

I am fine with bdi->dirty_ratelimit being called balanced rate. I am
taking exception to the fact that you are also taking into accout
pos_ratio while coming up with new balanced_rate after 200ms of feedback.

We agreed to updating bdi->dirty_ratelimit as follows (8 above).

 
 						      write_bw
   bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
 						      dirty_bw

I think in your terminology it could be called.
					   write_bw
  new_balanced_rate = prev_balanced_rate * ----------            (9)
					   dirty_bw

But what you seem to be doing is following.
							write_bw
  new_balanced_rate = prev_balanced_rate * pos_ratio * -----------  (10)
							dirty_bw

Of course I have just tried to simlify your actual calculations to
show why I am questioning the presence of pos_ratio while calculating
the new bdi->dirty_ratelimit. I am fine with limiting the step size etc.

So (9) and (10) don't match?

Now going back to your code and show how I arrived at (10).

executed_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT; (11)
balanced_rate = div_u64((u64)executed_rate * bdi->avg_write_bandwidth,
			dirty_rate | 1);			(12)

Combining (11) and (12) gives us (10).
				     write_bw
balance_rate = base_rate * pos_ratio --------
				     dirty_rate

Or
					    write_bw
bdi->dirty_ratelimit = base_rate * pos_ratio --------
					     dirty_rate

To complicate the things you also have the notion of pos_rate and reduce
the step size based on either pos_rate or balance_rate.

pos_rate = executed_rate = base_rate * pos_ratio;

				     write_bw
balance_rate = base_rate * pos_ratio --------
				     dirty_rate

bdi->dirty_rate_limit = min_change(pos_rate, balance_rate)       (13)

So for feedback, why are not sticking to simply (9) and limit the step
size and not take pos_ratio into account. 

Even if you have to take it into account, it needs to be explained clearly
and so many rate definitions confuse things more. Keeping name constant
everywhere (even for local variables), helps understand the code better.

Look at number of rates we have in code and it gets so confusing.

balanced_rate
base_rate
bdi->dirty_ratelimit

executed_rate
pos_rate
task_ratelimit

dirty_rate
write_bw

Here balanced_rate, base_rate and bdi->dirty_ratelimit all seem to be
referring to same thing and that is not obivious from the code. Looks
like task->ratelimit and executed_rate and pos_rate are referring to same
thing.

So instead of 6 rates, we could atleast collpase the naming to 2 rates
to keep the context clear. Just prefix/suffix more strings to highlight
subtle difference between two rates.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-23 13:53                 ` Vivek Goyal
  0 siblings, 0 replies; 98+ messages in thread
From: Vivek Goyal @ 2011-08-23 13:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 09:07:21AM +0800, Wu Fengguang wrote:

[..]
> > > > So we refined the formula for calculating a tasks's effective rate
> > > > over a period of time to following.
> > > > 					    write_bw
> > > > 	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
> > > > 					    dirty_rate
> > > > 
> > > 
> > > That's not true. It should still be formula (7) when
> > > balance_drity_pages() considers pos_ratio.
> > 
> > Why it is not true? If I do some math, it sounds right. Let me summarize
> > my understanding again.
> 
> Ah sorry! (9) actually holds true, as made clear by your below reasoning.
> 
> > - In a steady state stable system, we want dirty_bw = write_bw, IOW.
> >  
> >   dirty_bw/write_bw = 1  		(1)
> > 
> >   If we can achieve above then that means we are throttling tasks at
> >   just right rate.
> > 
> > Or
> > -  dirty_bw  == write_bw
> >    N * task_ratelimit == write_bw
> >    task_ratelimit =  write_bw/N         (2)
> > 
> >   So as long as we can come up with a system where balance_dirty_pages()
> >   calculates task_ratelimit to be write_bw/N, we should be fine.
> 
> Right.
> 
> > - But this does not take care of imbalances. So if system goes out of
> >   balance before feedback loop kicks in and dirty rate shoots up, then
> >   cache size will grow and number of dirty pages will shoot up. Hence
> >   we brought in the notion of position ratio where we also vary a 
> >   tasks's dirty ratelimit based on number of dirty pages. So our
> >   effective formula became.
> > 
> >   task_ratelimit = write_bw/N * pos_ratio     (3)
> > 
> >   So as long as we meet (3), we should reach to stable state.
> 
> Right.
> 
> > -  But here N is unknown in advance so balance_drity_pages() can not make
> >    use of this formula directly. But write_bw and dirty_bw from previous
> >    200ms are known. So following can replace (3).
> > 
> > 				       write_bw
> >    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio      (4)
> > 					dirty_bw	
> > 
> >    dirty_bw = task_ratelimit_0 * N                (5)
> > 
> >    Substitute (5) in (4)
> > 
> >    task_ratelimit = write_bw/N * pos_ratio      (6)
> > 
> >    (6) is same as (3) which has been derived from (4) and that means at any
> >    given point of time (4) can be used by balance_drity_pages() to calculate
> >    a tasks's throttling rate.
> 
> Right. Sorry what's in my mind was
> 
>                                        write_bw
>     balanced_rate = task_ratelimit_0 * --------
>                                        dirty_bw        
> 
>     task_ratelimit = balanced_rate * pos_ratio
> 
> which is effective the same to your combined equation (4).
> 
> > - Now going back to (4). Because we have a feedback loop where we
> >   continuously update a previous number based on feedback, we can track
> >   previous value in bdi->dirty_ratelimit.
> > 
> > 				       write_bw
> >    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio 
> > 					dirty_bw	
> > 
> >    Or
> > 
> >    task_ratelimit = bdi->dirty_ratelimit * pos_ratio         (7)
> > 
> >    where
> > 					    write_bw	
> >   bdi->dirty_ratelimit = task_ratelimit_0 * ---------
> > 					    dirty_bw
> 
> Right.
> 
> >   Because task_ratelimit_0 is initial value to begin with and we will
> >   keep on coming with new value every 200ms, we should be able to write
> >   above as follows.
> > 
> > 						      write_bw
> >   bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
> > 						      dirty_bw
> > 
> >   Effectively we start with an initial value of task_ratelimit_0 and
> >   then keep on updating it based on rate change feedback every 200ms.
> 
> Right.
> 
> >   To summarize,
> > 
> >   We need to achieve (3) for a balanced system. Because we don't know the
> >   value of N in advance, we can use (4) to achieve effect of (3). So we
> >   start with a default value of task_ratelimit_0 and update that every
> >   200ms based on how write and dirty rate on device is changing (8). We also
> >   further refine that rate by pos_ratio so that any variations in number
> >   of dirty pages due to temporary imbalances in the system can be
> >   accounted for (7).
> > 
> > I see that you also use (7). I think only contention point is how
> > (8) is perceived. So can you please explain why do you think that
> > above calculation or (9) is wrong.
> 
> There is no contention point and (9) is right..Sorry it's my fault.
> We are well aligned in the above reasoning :)

Great. Now we are on same page now at least till this point.

> 
> > I can kind of understand that you have done various adjustments to keep the
> > task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that
> > I am not able to understand your calculations in updating bdi->dirty_ratelimit.  
> 
> You mean the below chunk of code? Which is effectively the same as this _one_
> line of code
> 
>         bdi->dirty_ratelimit = balanced_rate;
> 
> except for doing some tricks (conditional update and limiting step size) to
> stabilize bdi->dirty_ratelimit:

I am fine with bdi->dirty_ratelimit being called balanced rate. I am
taking exception to the fact that you are also taking into accout
pos_ratio while coming up with new balanced_rate after 200ms of feedback.

We agreed to updating bdi->dirty_ratelimit as follows (8 above).

 
 						      write_bw
   bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
 						      dirty_bw

I think in your terminology it could be called.
					   write_bw
  new_balanced_rate = prev_balanced_rate * ----------            (9)
					   dirty_bw

But what you seem to be doing is following.
							write_bw
  new_balanced_rate = prev_balanced_rate * pos_ratio * -----------  (10)
							dirty_bw

Of course I have just tried to simlify your actual calculations to
show why I am questioning the presence of pos_ratio while calculating
the new bdi->dirty_ratelimit. I am fine with limiting the step size etc.

So (9) and (10) don't match?

Now going back to your code and show how I arrived at (10).

executed_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT; (11)
balanced_rate = div_u64((u64)executed_rate * bdi->avg_write_bandwidth,
			dirty_rate | 1);			(12)

Combining (11) and (12) gives us (10).
				     write_bw
balance_rate = base_rate * pos_ratio --------
				     dirty_rate

Or
					    write_bw
bdi->dirty_ratelimit = base_rate * pos_ratio --------
					     dirty_rate

To complicate the things you also have the notion of pos_rate and reduce
the step size based on either pos_rate or balance_rate.

pos_rate = executed_rate = base_rate * pos_ratio;

				     write_bw
balance_rate = base_rate * pos_ratio --------
				     dirty_rate

bdi->dirty_rate_limit = min_change(pos_rate, balance_rate)       (13)

So for feedback, why are not sticking to simply (9) and limit the step
size and not take pos_ratio into account. 

Even if you have to take it into account, it needs to be explained clearly
and so many rate definitions confuse things more. Keeping name constant
everywhere (even for local variables), helps understand the code better.

Look at number of rates we have in code and it gets so confusing.

balanced_rate
base_rate
bdi->dirty_ratelimit

executed_rate
pos_rate
task_ratelimit

dirty_rate
write_bw

Here balanced_rate, base_rate and bdi->dirty_ratelimit all seem to be
referring to same thing and that is not obivious from the code. Looks
like task->ratelimit and executed_rate and pos_rate are referring to same
thing.

So instead of 6 rates, we could atleast collpase the naming to 2 rates
to keep the context clear. Just prefix/suffix more strings to highlight
subtle difference between two rates.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-23 13:53                 ` Vivek Goyal
@ 2011-08-24  3:09                   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-24  3:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 09:53:55PM +0800, Vivek Goyal wrote:
> On Tue, Aug 23, 2011 at 09:07:21AM +0800, Wu Fengguang wrote:
> 
> [..]
> > > > > So we refined the formula for calculating a tasks's effective rate
> > > > > over a period of time to following.
> > > > > 					    write_bw
> > > > > 	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
> > > > > 					    dirty_rate
> > > > > 
> > > > 
> > > > That's not true. It should still be formula (7) when
> > > > balance_drity_pages() considers pos_ratio.
> > > 
> > > Why it is not true? If I do some math, it sounds right. Let me summarize
> > > my understanding again.
> > 
> > Ah sorry! (9) actually holds true, as made clear by your below reasoning.
> > 
> > > - In a steady state stable system, we want dirty_bw = write_bw, IOW.
> > >  
> > >   dirty_bw/write_bw = 1  		(1)
> > > 
> > >   If we can achieve above then that means we are throttling tasks at
> > >   just right rate.
> > > 
> > > Or
> > > -  dirty_bw  == write_bw
> > >    N * task_ratelimit == write_bw
> > >    task_ratelimit =  write_bw/N         (2)
> > > 
> > >   So as long as we can come up with a system where balance_dirty_pages()
> > >   calculates task_ratelimit to be write_bw/N, we should be fine.
> > 
> > Right.
> > 
> > > - But this does not take care of imbalances. So if system goes out of
> > >   balance before feedback loop kicks in and dirty rate shoots up, then
> > >   cache size will grow and number of dirty pages will shoot up. Hence
> > >   we brought in the notion of position ratio where we also vary a 
> > >   tasks's dirty ratelimit based on number of dirty pages. So our
> > >   effective formula became.
> > > 
> > >   task_ratelimit = write_bw/N * pos_ratio     (3)
> > > 
> > >   So as long as we meet (3), we should reach to stable state.
> > 
> > Right.
> > 
> > > -  But here N is unknown in advance so balance_drity_pages() can not make
> > >    use of this formula directly. But write_bw and dirty_bw from previous
> > >    200ms are known. So following can replace (3).
> > > 
> > > 				       write_bw
> > >    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio      (4)
> > > 					dirty_bw	
> > > 
> > >    dirty_bw = task_ratelimit_0 * N                (5)
> > > 
> > >    Substitute (5) in (4)
> > > 
> > >    task_ratelimit = write_bw/N * pos_ratio      (6)
> > > 
> > >    (6) is same as (3) which has been derived from (4) and that means at any
> > >    given point of time (4) can be used by balance_drity_pages() to calculate
> > >    a tasks's throttling rate.
> > 
> > Right. Sorry what's in my mind was
> > 
> >                                        write_bw
> >     balanced_rate = task_ratelimit_0 * --------
> >                                        dirty_bw        
> > 
> >     task_ratelimit = balanced_rate * pos_ratio
> > 
> > which is effective the same to your combined equation (4).
> > 
> > > - Now going back to (4). Because we have a feedback loop where we
> > >   continuously update a previous number based on feedback, we can track
> > >   previous value in bdi->dirty_ratelimit.
> > > 
> > > 				       write_bw
> > >    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio 
> > > 					dirty_bw	
> > > 
> > >    Or
> > > 
> > >    task_ratelimit = bdi->dirty_ratelimit * pos_ratio         (7)
> > > 
> > >    where
> > > 					    write_bw	
> > >   bdi->dirty_ratelimit = task_ratelimit_0 * ---------
> > > 					    dirty_bw
> > 
> > Right.
> > 
> > >   Because task_ratelimit_0 is initial value to begin with and we will
> > >   keep on coming with new value every 200ms, we should be able to write
> > >   above as follows.
> > > 
> > > 						      write_bw
> > >   bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
> > > 						      dirty_bw
> > > 
> > >   Effectively we start with an initial value of task_ratelimit_0 and
> > >   then keep on updating it based on rate change feedback every 200ms.
> > 
> > Right.
> > 
> > >   To summarize,
> > > 
> > >   We need to achieve (3) for a balanced system. Because we don't know the
> > >   value of N in advance, we can use (4) to achieve effect of (3). So we
> > >   start with a default value of task_ratelimit_0 and update that every
> > >   200ms based on how write and dirty rate on device is changing (8). We also
> > >   further refine that rate by pos_ratio so that any variations in number
> > >   of dirty pages due to temporary imbalances in the system can be
> > >   accounted for (7).
> > > 
> > > I see that you also use (7). I think only contention point is how
> > > (8) is perceived. So can you please explain why do you think that
> > > above calculation or (9) is wrong.
> > 
> > There is no contention point and (9) is right..Sorry it's my fault.
> > We are well aligned in the above reasoning :)
> 
> Great. Now we are on same page now at least till this point.
> 
> > 
> > > I can kind of understand that you have done various adjustments to keep the
> > > task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that
> > > I am not able to understand your calculations in updating bdi->dirty_ratelimit.  
> > 
> > You mean the below chunk of code? Which is effectively the same as this _one_
> > line of code
> > 
> >         bdi->dirty_ratelimit = balanced_rate;
> > 
> > except for doing some tricks (conditional update and limiting step size) to
> > stabilize bdi->dirty_ratelimit:
> 
> I am fine with bdi->dirty_ratelimit being called balanced rate. I am
> taking exception to the fact that you are also taking into accout
> pos_ratio while coming up with new balanced_rate after 200ms of feedback.
> 
> We agreed to updating bdi->dirty_ratelimit as follows (8 above).
> 
>  
>  						      write_bw
>    bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
>  						      dirty_bw
> 
> I think in your terminology it could be called.
> 					   write_bw
>   new_balanced_rate = prev_balanced_rate * ----------            (9)
> 					   dirty_bw
> 
> But what you seem to be doing is following.
> 							write_bw
>   new_balanced_rate = prev_balanced_rate * pos_ratio * -----------  (10)
> 							dirty_bw
> 
> Of course I have just tried to simlify your actual calculations to
> show why I am questioning the presence of pos_ratio while calculating
> the new bdi->dirty_ratelimit. I am fine with limiting the step size etc.
> 
> So (9) and (10) don't match?
> 
> Now going back to your code and show how I arrived at (10).
> 
> executed_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT; (11)
> balanced_rate = div_u64((u64)executed_rate * bdi->avg_write_bandwidth,
> 			dirty_rate | 1);			(12)
> 
> Combining (11) and (12) gives us (10).
> 				     write_bw
> balance_rate = base_rate * pos_ratio --------
> 				     dirty_rate
> 
> Or
> 					    write_bw
> bdi->dirty_ratelimit = base_rate * pos_ratio --------
> 					     dirty_rate

I hope the other email on the balanced_rate estimation equation can
clarify the questions on pos_ratio..

> To complicate the things you also have the notion of pos_rate and reduce
> the step size based on either pos_rate or balance_rate.
> 
> pos_rate = executed_rate = base_rate * pos_ratio;
> 
> 				     write_bw
> balance_rate = base_rate * pos_ratio --------
> 				     dirty_rate
> 
> bdi->dirty_rate_limit = min_change(pos_rate, balance_rate)       (13)
> 
> So for feedback, why are not sticking to simply (9) and limit the step
> size and not take pos_ratio into account. 

pos_rate is used to limit the step size. This reply to Peter has more
details:

http://www.spinics.net/lists/linux-fsdevel/msg47991.html

> Even if you have to take it into account, it needs to be explained clearly
> and so many rate definitions confuse things more. Keeping name constant
> everywhere (even for local variables), helps understand the code better.
> 

Good idea! There are two many names that differs subtly..

> Look at number of rates we have in code and it gets so confusing.
> 
> balanced_rate
> base_rate
> bdi->dirty_ratelimit
> 
> executed_rate
> pos_rate
> task_ratelimit
> 
> dirty_rate
> write_bw
> 
> Here balanced_rate, base_rate and bdi->dirty_ratelimit all seem to be
> referring to same thing and that is not obivious from the code. Looks
> like task->ratelimit and executed_rate and pos_rate are referring to same
> thing.

Right.

> So instead of 6 rates, we could atleast collpase the naming to 2 rates
> to keep the context clear. Just prefix/suffix more strings to highlight
> subtle difference between two rates.

How about

  balanced_rate            =>  balanced_dirty_ratelimit
  base_rate                =>  dirty_ratelimit
  bdi->dirty_ratelimit     ==  bdi->dirty_ratelimit

  pos_rate                 =>  task_ratelimit
  executed_rate            =>  task_ratelimit
  task_ratelimit           ==  task_ratelimit

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-24  3:09                   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-24  3:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 09:53:55PM +0800, Vivek Goyal wrote:
> On Tue, Aug 23, 2011 at 09:07:21AM +0800, Wu Fengguang wrote:
> 
> [..]
> > > > > So we refined the formula for calculating a tasks's effective rate
> > > > > over a period of time to following.
> > > > > 					    write_bw
> > > > > 	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
> > > > > 					    dirty_rate
> > > > > 
> > > > 
> > > > That's not true. It should still be formula (7) when
> > > > balance_drity_pages() considers pos_ratio.
> > > 
> > > Why it is not true? If I do some math, it sounds right. Let me summarize
> > > my understanding again.
> > 
> > Ah sorry! (9) actually holds true, as made clear by your below reasoning.
> > 
> > > - In a steady state stable system, we want dirty_bw = write_bw, IOW.
> > >  
> > >   dirty_bw/write_bw = 1  		(1)
> > > 
> > >   If we can achieve above then that means we are throttling tasks at
> > >   just right rate.
> > > 
> > > Or
> > > -  dirty_bw  == write_bw
> > >    N * task_ratelimit == write_bw
> > >    task_ratelimit =  write_bw/N         (2)
> > > 
> > >   So as long as we can come up with a system where balance_dirty_pages()
> > >   calculates task_ratelimit to be write_bw/N, we should be fine.
> > 
> > Right.
> > 
> > > - But this does not take care of imbalances. So if system goes out of
> > >   balance before feedback loop kicks in and dirty rate shoots up, then
> > >   cache size will grow and number of dirty pages will shoot up. Hence
> > >   we brought in the notion of position ratio where we also vary a 
> > >   tasks's dirty ratelimit based on number of dirty pages. So our
> > >   effective formula became.
> > > 
> > >   task_ratelimit = write_bw/N * pos_ratio     (3)
> > > 
> > >   So as long as we meet (3), we should reach to stable state.
> > 
> > Right.
> > 
> > > -  But here N is unknown in advance so balance_drity_pages() can not make
> > >    use of this formula directly. But write_bw and dirty_bw from previous
> > >    200ms are known. So following can replace (3).
> > > 
> > > 				       write_bw
> > >    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio      (4)
> > > 					dirty_bw	
> > > 
> > >    dirty_bw = task_ratelimit_0 * N                (5)
> > > 
> > >    Substitute (5) in (4)
> > > 
> > >    task_ratelimit = write_bw/N * pos_ratio      (6)
> > > 
> > >    (6) is same as (3) which has been derived from (4) and that means at any
> > >    given point of time (4) can be used by balance_drity_pages() to calculate
> > >    a tasks's throttling rate.
> > 
> > Right. Sorry what's in my mind was
> > 
> >                                        write_bw
> >     balanced_rate = task_ratelimit_0 * --------
> >                                        dirty_bw        
> > 
> >     task_ratelimit = balanced_rate * pos_ratio
> > 
> > which is effective the same to your combined equation (4).
> > 
> > > - Now going back to (4). Because we have a feedback loop where we
> > >   continuously update a previous number based on feedback, we can track
> > >   previous value in bdi->dirty_ratelimit.
> > > 
> > > 				       write_bw
> > >    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio 
> > > 					dirty_bw	
> > > 
> > >    Or
> > > 
> > >    task_ratelimit = bdi->dirty_ratelimit * pos_ratio         (7)
> > > 
> > >    where
> > > 					    write_bw	
> > >   bdi->dirty_ratelimit = task_ratelimit_0 * ---------
> > > 					    dirty_bw
> > 
> > Right.
> > 
> > >   Because task_ratelimit_0 is initial value to begin with and we will
> > >   keep on coming with new value every 200ms, we should be able to write
> > >   above as follows.
> > > 
> > > 						      write_bw
> > >   bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
> > > 						      dirty_bw
> > > 
> > >   Effectively we start with an initial value of task_ratelimit_0 and
> > >   then keep on updating it based on rate change feedback every 200ms.
> > 
> > Right.
> > 
> > >   To summarize,
> > > 
> > >   We need to achieve (3) for a balanced system. Because we don't know the
> > >   value of N in advance, we can use (4) to achieve effect of (3). So we
> > >   start with a default value of task_ratelimit_0 and update that every
> > >   200ms based on how write and dirty rate on device is changing (8). We also
> > >   further refine that rate by pos_ratio so that any variations in number
> > >   of dirty pages due to temporary imbalances in the system can be
> > >   accounted for (7).
> > > 
> > > I see that you also use (7). I think only contention point is how
> > > (8) is perceived. So can you please explain why do you think that
> > > above calculation or (9) is wrong.
> > 
> > There is no contention point and (9) is right..Sorry it's my fault.
> > We are well aligned in the above reasoning :)
> 
> Great. Now we are on same page now at least till this point.
> 
> > 
> > > I can kind of understand that you have done various adjustments to keep the
> > > task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that
> > > I am not able to understand your calculations in updating bdi->dirty_ratelimit.  
> > 
> > You mean the below chunk of code? Which is effectively the same as this _one_
> > line of code
> > 
> >         bdi->dirty_ratelimit = balanced_rate;
> > 
> > except for doing some tricks (conditional update and limiting step size) to
> > stabilize bdi->dirty_ratelimit:
> 
> I am fine with bdi->dirty_ratelimit being called balanced rate. I am
> taking exception to the fact that you are also taking into accout
> pos_ratio while coming up with new balanced_rate after 200ms of feedback.
> 
> We agreed to updating bdi->dirty_ratelimit as follows (8 above).
> 
>  
>  						      write_bw
>    bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
>  						      dirty_bw
> 
> I think in your terminology it could be called.
> 					   write_bw
>   new_balanced_rate = prev_balanced_rate * ----------            (9)
> 					   dirty_bw
> 
> But what you seem to be doing is following.
> 							write_bw
>   new_balanced_rate = prev_balanced_rate * pos_ratio * -----------  (10)
> 							dirty_bw
> 
> Of course I have just tried to simlify your actual calculations to
> show why I am questioning the presence of pos_ratio while calculating
> the new bdi->dirty_ratelimit. I am fine with limiting the step size etc.
> 
> So (9) and (10) don't match?
> 
> Now going back to your code and show how I arrived at (10).
> 
> executed_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT; (11)
> balanced_rate = div_u64((u64)executed_rate * bdi->avg_write_bandwidth,
> 			dirty_rate | 1);			(12)
> 
> Combining (11) and (12) gives us (10).
> 				     write_bw
> balance_rate = base_rate * pos_ratio --------
> 				     dirty_rate
> 
> Or
> 					    write_bw
> bdi->dirty_ratelimit = base_rate * pos_ratio --------
> 					     dirty_rate

I hope the other email on the balanced_rate estimation equation can
clarify the questions on pos_ratio..

> To complicate the things you also have the notion of pos_rate and reduce
> the step size based on either pos_rate or balance_rate.
> 
> pos_rate = executed_rate = base_rate * pos_ratio;
> 
> 				     write_bw
> balance_rate = base_rate * pos_ratio --------
> 				     dirty_rate
> 
> bdi->dirty_rate_limit = min_change(pos_rate, balance_rate)       (13)
> 
> So for feedback, why are not sticking to simply (9) and limit the step
> size and not take pos_ratio into account. 

pos_rate is used to limit the step size. This reply to Peter has more
details:

http://www.spinics.net/lists/linux-fsdevel/msg47991.html

> Even if you have to take it into account, it needs to be explained clearly
> and so many rate definitions confuse things more. Keeping name constant
> everywhere (even for local variables), helps understand the code better.
> 

Good idea! There are two many names that differs subtly..

> Look at number of rates we have in code and it gets so confusing.
> 
> balanced_rate
> base_rate
> bdi->dirty_ratelimit
> 
> executed_rate
> pos_rate
> task_ratelimit
> 
> dirty_rate
> write_bw
> 
> Here balanced_rate, base_rate and bdi->dirty_ratelimit all seem to be
> referring to same thing and that is not obivious from the code. Looks
> like task->ratelimit and executed_rate and pos_rate are referring to same
> thing.

Right.

> So instead of 6 rates, we could atleast collpase the naming to 2 rates
> to keep the context clear. Just prefix/suffix more strings to highlight
> subtle difference between two rates.

How about

  balanced_rate            =>  balanced_dirty_ratelimit
  base_rate                =>  dirty_ratelimit
  bdi->dirty_ratelimit     ==  bdi->dirty_ratelimit

  pos_rate                 =>  task_ratelimit
  executed_rate            =>  task_ratelimit
  task_ratelimit           ==  task_ratelimit

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-17 20:24         ` Jan Kara
@ 2011-08-24  3:16           ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-24  3:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> > > > +	x_intercept = setpoint + 2 * span;
>    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> ~3*bdi_thresh... So maybe you should use bdi_thresh/2 in the computation of
> span?

OK, I'll follow your suggestion to use

        span = 8 * write_bw, for single bdi case 
        span = bdi_thresh, for JBOD case
        x_intercept = setpoint + span;

It does make sense to squeeze the bdi_dirty fluctuation range a bit by
doubling span and making the control line more sharp.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-24  3:16           ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-24  3:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> > > > +	x_intercept = setpoint + 2 * span;
>    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> ~3*bdi_thresh... So maybe you should use bdi_thresh/2 in the computation of
> span?

OK, I'll follow your suggestion to use

        span = 8 * write_bw, for single bdi case 
        span = bdi_thresh, for JBOD case
        x_intercept = setpoint + span;

It does make sense to squeeze the bdi_dirty fluctuation range a bit by
doubling span and making the control line more sharp.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-10 17:10           ` Peter Zijlstra
@ 2011-08-15 14:11             ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-15 14:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 11, 2011 at 01:10:26AM +0800, Peter Zijlstra wrote:
> On Wed, 2011-08-10 at 22:00 +0800, Wu Fengguang wrote:
> > 
> > > Although I'm not quite sure how he keeps fairness in light of the
> > > sleep time bounding to MAX_PAUSE.
> > 
> > Firstly, MAX_PAUSE will only be applied when the dirty pages rush
> > high (dirty exceeded).  Secondly, the dirty exceeded state is global
> > to all tasks, in which case each task will sleep for MAX_PAUSE equally.
> > So the fairness is still maintained in dirty exceeded state. 
> 
> Its not immediately apparent how dirty_exceeded and MAX_PAUSE interact,
> but having everybody sleep MAX_PAUSE doesn't necessarily mean its fair,
> its only fair if they dirty at the same rate.

Yeah I forget to mention that, but when dirty_exceeded, the tasks will
typically sleep for MAX_PAUSE on every 8 pages, so resulting in the
same dirty rate :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-15 14:11             ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-15 14:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 11, 2011 at 01:10:26AM +0800, Peter Zijlstra wrote:
> On Wed, 2011-08-10 at 22:00 +0800, Wu Fengguang wrote:
> > 
> > > Although I'm not quite sure how he keeps fairness in light of the
> > > sleep time bounding to MAX_PAUSE.
> > 
> > Firstly, MAX_PAUSE will only be applied when the dirty pages rush
> > high (dirty exceeded).  Secondly, the dirty exceeded state is global
> > to all tasks, in which case each task will sleep for MAX_PAUSE equally.
> > So the fairness is still maintained in dirty exceeded state. 
> 
> Its not immediately apparent how dirty_exceeded and MAX_PAUSE interact,
> but having everybody sleep MAX_PAUSE doesn't necessarily mean its fair,
> its only fair if they dirty at the same rate.

Yeah I forget to mention that, but when dirty_exceeded, the tasks will
typically sleep for MAX_PAUSE on every 8 pages, so resulting in the
same dirty rate :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-10 16:17         ` Peter Zijlstra
@ 2011-08-15 14:08           ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-15 14:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 11, 2011 at 12:17:55AM +0800, Peter Zijlstra wrote:
> How about something like the below, it still needs some more work, but
> its more or less complete in that is now explains both controls in one
> story. The actual update bit is still missing.

Looks pretty good, thanks!  I'll post the completed version at the
bottom.

> ---
> 
> balance_dirty_pages() needs to throttle tasks dirtying pages such that
> the total amount of dirty pages stays below the specified dirty limit in
> order to avoid memory deadlocks. Furthermore we desire fairness in that
> tasks get throttled proportionally to the amount of pages they dirty.
> 
> IOW we want to throttle tasks such that we match the dirty rate to the
> writeout bandwidth, this yields a stable amount of dirty pages:
> 
> 	ratelimit = writeout_bandwidth
> 
> The fairness requirements gives us:
> 
> 	task_ratelimit = write_bandwidth / N
> 
> > : When started N dd, we would like to throttle each dd at
> > : 
> > :          balanced_rate == write_bw / N                                  (1)
> > : 
> > : We don't know N beforehand, but still can estimate balanced_rate
> > : within 200ms.
> > : 
> > : Start by throttling each dd task at rate
> > : 
> > :         task_ratelimit = task_ratelimit_0                               (2)
> > :                          (any non-zero initial value is OK)
> > : 
> > : After 200ms, we got
> > : 
> > :         dirty_rate = # of pages dirtied by all dd's / 200ms
> > :         write_bw   = # of pages written to the disk / 200ms
> > : 
> > : For the aggressive dd dirtiers, the equality holds
> > : 
> > :         dirty_rate == N * task_rate
> > :                    == N * task_ratelimit
> > :                    == N * task_ratelimit_0                              (3)
> > : Or
> > :         task_ratelimit_0 = dirty_rate / N                               (4)
> > :                           
> > : So the balanced throttle bandwidth can be estimated by
> > :                           
> > :         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (5)
> > :                           
> > : Because with (4) and (5) we can get the desired equality (1):
> > :                           
> > :         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> > :                       == write_bw / N
> 
> Then using the balance_rate we can compute task pause times like:
> 
> 	task_pause = task->nr_dirtied / task_ratelimit
> 
> [ however all that still misses the primary feedback of:
> 
>    task_ratelimit_(i+1) = task_ratelimit_i * (write_bw / dirty_rate)
> 
>   there's still some confusion in the above due to task_ratelimit and
>   balanced_rate.
> ]
> 
> However, while the above gives us means of matching the dirty rate to
> the writeout bandwidth, it at best provides us with a stable dirty page
> count (assuming a static system). In order to control the dirty page
> count such that it is high enough to provide performance, but does not
> exceed the specified limit we need another control.
> 
> > So if the dirty pages are ABOVE the setpoints, we throttle each task
> > a bit more HEAVY than balanced_rate, so that the dirty pages are
> > created less fast than they are cleaned, thus DROP to the setpoints
> > (and the reverse). With that positional adjustment, the formula is
> > transformed from
> > 
> >         task_ratelimit = balanced_rate
> > 
> > to
> > 
> >         task_ratelimit = balanced_rate * pos_ratio
> 
> > In terms of the negative feedback control theory, the
> > bdi_position_ratio() function (control lines) can be expressed as
> > 
> > 1) f(setpoint) = 1.0
> > 2) df/dt < 0
> > 
> > 3) optionally, abs(df/dt) should be large on large errors (= dirty -
> >    setpoint) in order to cancel the errors fast, and be smaller when
> >    dirty pages get closer to the setpoints in order to avoid overshooting.
> 
> 

Estimation of balanced bdi->dirty_ratelimit
===========================================

balanced task_ratelimit
-----------------------

balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.

IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:

	ratelimit = write_bw						(1)

The fairness requirement gives us:

        task_ratelimit = write_bw / N					(2)

where N is the number of dd tasks.  We don't know N beforehand, but
still can estimate the balanced task_ratelimit within 200ms.

Start by throttling each dd task at rate

        task_ratelimit = task_ratelimit_0				(3)
 		  	 (any non-zero initial value is OK)

After 200ms, we measured

        dirty_rate = # of pages dirtied by all dd's / 200ms
        write_bw   = # of pages written to the disk / 200ms

For the aggressive dd dirtiers, the equality holds

	dirty_rate == N * task_rate
                   == N * task_ratelimit
                   == N * task_ratelimit_0            			(4)
Or
	task_ratelimit_0 = dirty_rate / N            			(5)

Now we conclude that the balanced task ratelimit can be estimated by

        task_ratelimit = task_ratelimit_0 * (write_bw / dirty_rate)	(6)

Because with (4) and (5) we can get the desired equality (1):

	task_ratelimit == (dirty_rate / N) * (write_bw / dirty_rate)
	       	       == write_bw / N

Then using the balanced task ratelimit we can compute task pause times like:
        
        task_pause = task->nr_dirtied / task_ratelimit

task_ratelimit with position control
------------------------------------

However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.

The dirty position control works by splitting (6) to

        task_ratelimit = balanced_rate					(7)
        balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)	(8)

and extend (7) to

        task_ratelimit = balanced_rate * pos_ratio			(9)

where pos_ratio is a negative feedback function that subjects to

1) f(setpoint) = 1.0
2) df/dx < 0

That is, if the dirty pages are ABOVE the setpoint, we throttle each
task a bit more HEAVY than balanced_rate, so that the dirty pages are
created less fast than they are cleaned, thus DROP to the setpoints
(and the reverse).

bdi->dirty_ratelimit update policy
----------------------------------

The balanced_rate calculated by (8) is not suitable for direct use (*).
For the reasons listed below, (9) is further transformed into

	task_ratelimit = dirty_ratelimit * pos_ratio			(10)

where dirty_ratelimit will be tracking balanced_rate _conservatively_.

---
(*) There are some imperfections in balanced_rate, which make it not
suitable for direct use:

1) large fluctuations

The dirty_rate used for computing balanced_rate is merely averaged in
the past 200ms (very small comparing to the 3s estimation period for
write_bw), which makes rather dispersed distribution of balanced_rate.

It's pretty hard to average out the singular points by increasing the
estimation period. Considering that the averaging technique will
introduce very undesirable time lags, I give it up totally. (btw, the 3s
write_bw averaging time lag is much more acceptable because its impact
is one-way and therefore won't lead to oscillations.)

The more practical way is filtering -- most singular balanced_rate
points can be filtered out by remembering some prev_balanced_rate and
prev_prev_balanced_rate. However the more reliable way is to guard
balanced_rate with pos_rate.

2) due to truncates and fs redirties, the (write_bw <=> dirty_rate)
match could become unbalanced, which may lead to large systematical
errors in balanced_rate. The truncates, due to its possibly bumpy
nature, can hardly be compensated smoothly. So let's face it. When some
over-estimated balanced_rate brings dirty_ratelimit high, dirty pages
will go higher than the setpoint. pos_rate will in turn become lower
than dirty_ratelimit.  So if we consider both balanced_rate and pos_rate
and update dirty_ratelimit only when they are on the same side of
dirty_ratelimit, the systematical errors in balanced_rate won't be able
to bring dirty_ratelimit far away.

The balanced_rate estimation may also be inaccurate when near the max
pause and free run areas, however is less an issue.

3) since we ultimately want to

- keep the fluctuations of task ratelimit as small as possible
- keep the dirty pages around the setpoint as long time as possible

the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (pos_rate < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < balanced_rate), there is
no point to bring up dirty_ratelimit in a hurry only to hurt both the
above two goals.

In summary, the dirty_ratelimit update policy consists of two constraints:

1) avoid changing dirty rate when it's against the position control target
   (the adjusted rate will slow down the progress of dirty pages going
   back to setpoint).

2) limit the step size. pos_rate is changing values step by step,
   leaving a consistent trace comparing to the randomly jumping
   balanced_rate. pos_rate also has the nice smaller errors in stable
   state and typically larger errors when there are big errors in rate.
   So it's a pretty good limiting factor for the step size of dirty_ratelimit.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-15 14:08           ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-15 14:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 11, 2011 at 12:17:55AM +0800, Peter Zijlstra wrote:
> How about something like the below, it still needs some more work, but
> its more or less complete in that is now explains both controls in one
> story. The actual update bit is still missing.

Looks pretty good, thanks!  I'll post the completed version at the
bottom.

> ---
> 
> balance_dirty_pages() needs to throttle tasks dirtying pages such that
> the total amount of dirty pages stays below the specified dirty limit in
> order to avoid memory deadlocks. Furthermore we desire fairness in that
> tasks get throttled proportionally to the amount of pages they dirty.
> 
> IOW we want to throttle tasks such that we match the dirty rate to the
> writeout bandwidth, this yields a stable amount of dirty pages:
> 
> 	ratelimit = writeout_bandwidth
> 
> The fairness requirements gives us:
> 
> 	task_ratelimit = write_bandwidth / N
> 
> > : When started N dd, we would like to throttle each dd at
> > : 
> > :          balanced_rate == write_bw / N                                  (1)
> > : 
> > : We don't know N beforehand, but still can estimate balanced_rate
> > : within 200ms.
> > : 
> > : Start by throttling each dd task at rate
> > : 
> > :         task_ratelimit = task_ratelimit_0                               (2)
> > :                          (any non-zero initial value is OK)
> > : 
> > : After 200ms, we got
> > : 
> > :         dirty_rate = # of pages dirtied by all dd's / 200ms
> > :         write_bw   = # of pages written to the disk / 200ms
> > : 
> > : For the aggressive dd dirtiers, the equality holds
> > : 
> > :         dirty_rate == N * task_rate
> > :                    == N * task_ratelimit
> > :                    == N * task_ratelimit_0                              (3)
> > : Or
> > :         task_ratelimit_0 = dirty_rate / N                               (4)
> > :                           
> > : So the balanced throttle bandwidth can be estimated by
> > :                           
> > :         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (5)
> > :                           
> > : Because with (4) and (5) we can get the desired equality (1):
> > :                           
> > :         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> > :                       == write_bw / N
> 
> Then using the balance_rate we can compute task pause times like:
> 
> 	task_pause = task->nr_dirtied / task_ratelimit
> 
> [ however all that still misses the primary feedback of:
> 
>    task_ratelimit_(i+1) = task_ratelimit_i * (write_bw / dirty_rate)
> 
>   there's still some confusion in the above due to task_ratelimit and
>   balanced_rate.
> ]
> 
> However, while the above gives us means of matching the dirty rate to
> the writeout bandwidth, it at best provides us with a stable dirty page
> count (assuming a static system). In order to control the dirty page
> count such that it is high enough to provide performance, but does not
> exceed the specified limit we need another control.
> 
> > So if the dirty pages are ABOVE the setpoints, we throttle each task
> > a bit more HEAVY than balanced_rate, so that the dirty pages are
> > created less fast than they are cleaned, thus DROP to the setpoints
> > (and the reverse). With that positional adjustment, the formula is
> > transformed from
> > 
> >         task_ratelimit = balanced_rate
> > 
> > to
> > 
> >         task_ratelimit = balanced_rate * pos_ratio
> 
> > In terms of the negative feedback control theory, the
> > bdi_position_ratio() function (control lines) can be expressed as
> > 
> > 1) f(setpoint) = 1.0
> > 2) df/dt < 0
> > 
> > 3) optionally, abs(df/dt) should be large on large errors (= dirty -
> >    setpoint) in order to cancel the errors fast, and be smaller when
> >    dirty pages get closer to the setpoints in order to avoid overshooting.
> 
> 

Estimation of balanced bdi->dirty_ratelimit
===========================================

balanced task_ratelimit
-----------------------

balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.

IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:

	ratelimit = write_bw						(1)

The fairness requirement gives us:

        task_ratelimit = write_bw / N					(2)

where N is the number of dd tasks.  We don't know N beforehand, but
still can estimate the balanced task_ratelimit within 200ms.

Start by throttling each dd task at rate

        task_ratelimit = task_ratelimit_0				(3)
 		  	 (any non-zero initial value is OK)

After 200ms, we measured

        dirty_rate = # of pages dirtied by all dd's / 200ms
        write_bw   = # of pages written to the disk / 200ms

For the aggressive dd dirtiers, the equality holds

	dirty_rate == N * task_rate
                   == N * task_ratelimit
                   == N * task_ratelimit_0            			(4)
Or
	task_ratelimit_0 = dirty_rate / N            			(5)

Now we conclude that the balanced task ratelimit can be estimated by

        task_ratelimit = task_ratelimit_0 * (write_bw / dirty_rate)	(6)

Because with (4) and (5) we can get the desired equality (1):

	task_ratelimit == (dirty_rate / N) * (write_bw / dirty_rate)
	       	       == write_bw / N

Then using the balanced task ratelimit we can compute task pause times like:
        
        task_pause = task->nr_dirtied / task_ratelimit

task_ratelimit with position control
------------------------------------

However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.

The dirty position control works by splitting (6) to

        task_ratelimit = balanced_rate					(7)
        balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)	(8)

and extend (7) to

        task_ratelimit = balanced_rate * pos_ratio			(9)

where pos_ratio is a negative feedback function that subjects to

1) f(setpoint) = 1.0
2) df/dx < 0

That is, if the dirty pages are ABOVE the setpoint, we throttle each
task a bit more HEAVY than balanced_rate, so that the dirty pages are
created less fast than they are cleaned, thus DROP to the setpoints
(and the reverse).

bdi->dirty_ratelimit update policy
----------------------------------

The balanced_rate calculated by (8) is not suitable for direct use (*).
For the reasons listed below, (9) is further transformed into

	task_ratelimit = dirty_ratelimit * pos_ratio			(10)

where dirty_ratelimit will be tracking balanced_rate _conservatively_.

---
(*) There are some imperfections in balanced_rate, which make it not
suitable for direct use:

1) large fluctuations

The dirty_rate used for computing balanced_rate is merely averaged in
the past 200ms (very small comparing to the 3s estimation period for
write_bw), which makes rather dispersed distribution of balanced_rate.

It's pretty hard to average out the singular points by increasing the
estimation period. Considering that the averaging technique will
introduce very undesirable time lags, I give it up totally. (btw, the 3s
write_bw averaging time lag is much more acceptable because its impact
is one-way and therefore won't lead to oscillations.)

The more practical way is filtering -- most singular balanced_rate
points can be filtered out by remembering some prev_balanced_rate and
prev_prev_balanced_rate. However the more reliable way is to guard
balanced_rate with pos_rate.

2) due to truncates and fs redirties, the (write_bw <=> dirty_rate)
match could become unbalanced, which may lead to large systematical
errors in balanced_rate. The truncates, due to its possibly bumpy
nature, can hardly be compensated smoothly. So let's face it. When some
over-estimated balanced_rate brings dirty_ratelimit high, dirty pages
will go higher than the setpoint. pos_rate will in turn become lower
than dirty_ratelimit.  So if we consider both balanced_rate and pos_rate
and update dirty_ratelimit only when they are on the same side of
dirty_ratelimit, the systematical errors in balanced_rate won't be able
to bring dirty_ratelimit far away.

The balanced_rate estimation may also be inaccurate when near the max
pause and free run areas, however is less an issue.

3) since we ultimately want to

- keep the fluctuations of task ratelimit as small as possible
- keep the dirty pages around the setpoint as long time as possible

the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (pos_rate < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < balanced_rate), there is
no point to bring up dirty_ratelimit in a hurry only to hurt both the
above two goals.

In summary, the dirty_ratelimit update policy consists of two constraints:

1) avoid changing dirty rate when it's against the position control target
   (the adjusted rate will slow down the progress of dirty pages going
   back to setpoint).

2) limit the step size. pos_rate is changing values step by step,
   leaving a consistent trace comparing to the randomly jumping
   balanced_rate. pos_rate also has the nice smaller errors in stable
   state and typically larger errors when there are big errors in rate.
   So it's a pretty good limiting factor for the step size of dirty_ratelimit.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-09 14:54     ` Vivek Goyal
@ 2011-08-11  3:42       ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-11  3:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Tue, Aug 09, 2011 at 10:54:38PM +0800, Vivek Goyal wrote:
> On Sat, Aug 06, 2011 at 04:44:50PM +0800, Wu Fengguang wrote:
> > It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
> > when there are N dd tasks.
> > 
> > On write() syscall, use bdi->dirty_ratelimit
> > ============================================
> > 
> >     balance_dirty_pages(pages_dirtied)
> >     {
> >         pos_bw = bdi->dirty_ratelimit * bdi_position_ratio();
> >         pause = pages_dirtied / pos_bw;
> >         sleep(pause);
> >     }
> > 
> > On every 200ms, update bdi->dirty_ratelimit
> > ===========================================
> > 
> >     bdi_update_dirty_ratelimit()
> >     {
> >         bw = bdi->dirty_ratelimit;
> >         ref_bw = bw * bdi_position_ratio() * write_bw / dirty_bw;
> >         if (dirty pages unbalanced)
> >              bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4;
> >     }
> > 
> > Estimation of balanced bdi->dirty_ratelimit
> > ===========================================
> > 
> > When started N dd, throttle each dd at
> > 
> >          task_ratelimit = pos_bw (any non-zero initial value is OK)
> > 
> > After 200ms, we got
> > 
> >          dirty_bw = # of pages dirtied by app / 200ms
> >          write_bw = # of pages written to disk / 200ms
> > 
> > For aggressive dirtiers, the equality holds
> > 
> >          dirty_bw == N * task_ratelimit
> >                   == N * pos_bw                      	(1)
> > 
> > The balanced throttle bandwidth can be estimated by
> > 
> >          ref_bw = pos_bw * write_bw / dirty_bw       	(2)
> > 
> > >From (1) and (2), we get equality
> > 
> >          ref_bw == write_bw / N                      	(3)
> > 
> > If the N dd's are all throttled at ref_bw, the dirty/writeback rates
> > will match. So ref_bw is the balanced dirty rate.
> 
> Hi Fengguang,

Hi Vivek,

> So how much work it is to extend all this to handle the case of cgroups?

Here is the simplest form.

writeback: async write IO controllers
http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=blobdiff;f=mm/page-writeback.c;h=0b579e7fd338fd1f59cc36bf15fda06ff6260634;hp=34dff9f0d28d0f4f0794eb41187f71b4ade6b8a2;hb=1a58ad99ce1f6a9df6618a4b92fa4859cc3e7e90;hpb=5b6fcb3125ea52ff04a2fad27a51307842deb1a0

And an old email on this topic:

https://lkml.org/lkml/2011/4/28/229

> IOW, I would imagine that you shall have to keep track of per cgroup/per
> bdi state of many of the variables. For example, write_bw will become
> per cgroup/per bdi entity instead of per bdi entity only. Same should
> be true for position ratio, dirty_bw etc?
 
The dirty_bw, write_bw and dirty_ratelimit should be replicated,
but not necessarily dirty pages and position ratio.

The cgroup can just rely on the root cgroup's dirty pages position
control if it does not care about its own dirty pages consumptions.

> I am assuming that if some cgroup is low weight on end device, then
> WRITE bandwidth of that cgroup should go down and that should be
> accounted for at per bdi state and task throttling should happen
> accordingly so that a lower weight cgroup tasks get throttled more
> as compared to higher weight cgroup tasks?

Sorry I don't quite catch your meaning, but the current
->dirty_ratelimit adaptation scheme (detailed in another email) should
handle all such rate/bw allocation issues automatically?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-11  3:42       ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-11  3:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Tue, Aug 09, 2011 at 10:54:38PM +0800, Vivek Goyal wrote:
> On Sat, Aug 06, 2011 at 04:44:50PM +0800, Wu Fengguang wrote:
> > It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
> > when there are N dd tasks.
> > 
> > On write() syscall, use bdi->dirty_ratelimit
> > ============================================
> > 
> >     balance_dirty_pages(pages_dirtied)
> >     {
> >         pos_bw = bdi->dirty_ratelimit * bdi_position_ratio();
> >         pause = pages_dirtied / pos_bw;
> >         sleep(pause);
> >     }
> > 
> > On every 200ms, update bdi->dirty_ratelimit
> > ===========================================
> > 
> >     bdi_update_dirty_ratelimit()
> >     {
> >         bw = bdi->dirty_ratelimit;
> >         ref_bw = bw * bdi_position_ratio() * write_bw / dirty_bw;
> >         if (dirty pages unbalanced)
> >              bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4;
> >     }
> > 
> > Estimation of balanced bdi->dirty_ratelimit
> > ===========================================
> > 
> > When started N dd, throttle each dd at
> > 
> >          task_ratelimit = pos_bw (any non-zero initial value is OK)
> > 
> > After 200ms, we got
> > 
> >          dirty_bw = # of pages dirtied by app / 200ms
> >          write_bw = # of pages written to disk / 200ms
> > 
> > For aggressive dirtiers, the equality holds
> > 
> >          dirty_bw == N * task_ratelimit
> >                   == N * pos_bw                      	(1)
> > 
> > The balanced throttle bandwidth can be estimated by
> > 
> >          ref_bw = pos_bw * write_bw / dirty_bw       	(2)
> > 
> > >From (1) and (2), we get equality
> > 
> >          ref_bw == write_bw / N                      	(3)
> > 
> > If the N dd's are all throttled at ref_bw, the dirty/writeback rates
> > will match. So ref_bw is the balanced dirty rate.
> 
> Hi Fengguang,

Hi Vivek,

> So how much work it is to extend all this to handle the case of cgroups?

Here is the simplest form.

writeback: async write IO controllers
http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=blobdiff;f=mm/page-writeback.c;h=0b579e7fd338fd1f59cc36bf15fda06ff6260634;hp=34dff9f0d28d0f4f0794eb41187f71b4ade6b8a2;hb=1a58ad99ce1f6a9df6618a4b92fa4859cc3e7e90;hpb=5b6fcb3125ea52ff04a2fad27a51307842deb1a0

And an old email on this topic:

https://lkml.org/lkml/2011/4/28/229

> IOW, I would imagine that you shall have to keep track of per cgroup/per
> bdi state of many of the variables. For example, write_bw will become
> per cgroup/per bdi entity instead of per bdi entity only. Same should
> be true for position ratio, dirty_bw etc?
 
The dirty_bw, write_bw and dirty_ratelimit should be replicated,
but not necessarily dirty pages and position ratio.

The cgroup can just rely on the root cgroup's dirty pages position
control if it does not care about its own dirty pages consumptions.

> I am assuming that if some cgroup is low weight on end device, then
> WRITE bandwidth of that cgroup should go down and that should be
> accounted for at per bdi state and task throttling should happen
> accordingly so that a lower weight cgroup tasks get throttled more
> as compared to higher weight cgroup tasks?

Sorry I don't quite catch your meaning, but the current
->dirty_ratelimit adaptation scheme (detailed in another email) should
handle all such rate/bw allocation issues automatically?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-10 14:00         ` Wu Fengguang
@ 2011-08-10 17:10           ` Peter Zijlstra
  -1 siblings, 0 replies; 98+ messages in thread
From: Peter Zijlstra @ 2011-08-10 17:10 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-08-10 at 22:00 +0800, Wu Fengguang wrote:
> 
> > Although I'm not quite sure how he keeps fairness in light of the
> > sleep time bounding to MAX_PAUSE.
> 
> Firstly, MAX_PAUSE will only be applied when the dirty pages rush
> high (dirty exceeded).  Secondly, the dirty exceeded state is global
> to all tasks, in which case each task will sleep for MAX_PAUSE equally.
> So the fairness is still maintained in dirty exceeded state. 

Its not immediately apparent how dirty_exceeded and MAX_PAUSE interact,
but having everybody sleep MAX_PAUSE doesn't necessarily mean its fair,
its only fair if they dirty at the same rate.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-10 17:10           ` Peter Zijlstra
  0 siblings, 0 replies; 98+ messages in thread
From: Peter Zijlstra @ 2011-08-10 17:10 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-08-10 at 22:00 +0800, Wu Fengguang wrote:
> 
> > Although I'm not quite sure how he keeps fairness in light of the
> > sleep time bounding to MAX_PAUSE.
> 
> Firstly, MAX_PAUSE will only be applied when the dirty pages rush
> high (dirty exceeded).  Secondly, the dirty exceeded state is global
> to all tasks, in which case each task will sleep for MAX_PAUSE equally.
> So the fairness is still maintained in dirty exceeded state. 

Its not immediately apparent how dirty_exceeded and MAX_PAUSE interact,
but having everybody sleep MAX_PAUSE doesn't necessarily mean its fair,
its only fair if they dirty at the same rate.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-10 11:07       ` Wu Fengguang
  (?)
@ 2011-08-10 16:17         ` Peter Zijlstra
  -1 siblings, 0 replies; 98+ messages in thread
From: Peter Zijlstra @ 2011-08-10 16:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

How about something like the below, it still needs some more work, but
its more or less complete in that is now explains both controls in one
story. The actual update bit is still missing.

---

balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.

IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:

	ratelimit = writeout_bandwidth

The fairness requirements gives us:

	task_ratelimit = write_bandwidth / N

> : When started N dd, we would like to throttle each dd at
> : 
> :          balanced_rate == write_bw / N                                  (1)
> : 
> : We don't know N beforehand, but still can estimate balanced_rate
> : within 200ms.
> : 
> : Start by throttling each dd task at rate
> : 
> :         task_ratelimit = task_ratelimit_0                               (2)
> :                          (any non-zero initial value is OK)
> : 
> : After 200ms, we got
> : 
> :         dirty_rate = # of pages dirtied by all dd's / 200ms
> :         write_bw   = # of pages written to the disk / 200ms
> : 
> : For the aggressive dd dirtiers, the equality holds
> : 
> :         dirty_rate == N * task_rate
> :                    == N * task_ratelimit
> :                    == N * task_ratelimit_0                              (3)
> : Or
> :         task_ratelimit_0 = dirty_rate / N                               (4)
> :                           
> : So the balanced throttle bandwidth can be estimated by
> :                           
> :         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (5)
> :                           
> : Because with (4) and (5) we can get the desired equality (1):
> :                           
> :         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> :                       == write_bw / N

Then using the balance_rate we can compute task pause times like:

	task_pause = task->nr_dirtied / task_ratelimit

[ however all that still misses the primary feedback of:

   task_ratelimit_(i+1) = task_ratelimit_i * (write_bw / dirty_rate)

  there's still some confusion in the above due to task_ratelimit and
  balanced_rate.
]

However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.

> So if the dirty pages are ABOVE the setpoints, we throttle each task
> a bit more HEAVY than balanced_rate, so that the dirty pages are
> created less fast than they are cleaned, thus DROP to the setpoints
> (and the reverse). With that positional adjustment, the formula is
> transformed from
> 
>         task_ratelimit = balanced_rate
> 
> to
> 
>         task_ratelimit = balanced_rate * pos_ratio

> In terms of the negative feedback control theory, the
> bdi_position_ratio() function (control lines) can be expressed as
> 
> 1) f(setpoint) = 1.0
> 2) df/dt < 0
> 
> 3) optionally, abs(df/dt) should be large on large errors (= dirty -
>    setpoint) in order to cancel the errors fast, and be smaller when
>    dirty pages get closer to the setpoints in order to avoid overshooting.



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-10 16:17         ` Peter Zijlstra
  0 siblings, 0 replies; 98+ messages in thread
From: Peter Zijlstra @ 2011-08-10 16:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

How about something like the below, it still needs some more work, but
its more or less complete in that is now explains both controls in one
story. The actual update bit is still missing.

---

balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.

IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:

	ratelimit = writeout_bandwidth

The fairness requirements gives us:

	task_ratelimit = write_bandwidth / N

> : When started N dd, we would like to throttle each dd at
> : 
> :          balanced_rate == write_bw / N                                  (1)
> : 
> : We don't know N beforehand, but still can estimate balanced_rate
> : within 200ms.
> : 
> : Start by throttling each dd task at rate
> : 
> :         task_ratelimit = task_ratelimit_0                               (2)
> :                          (any non-zero initial value is OK)
> : 
> : After 200ms, we got
> : 
> :         dirty_rate = # of pages dirtied by all dd's / 200ms
> :         write_bw   = # of pages written to the disk / 200ms
> : 
> : For the aggressive dd dirtiers, the equality holds
> : 
> :         dirty_rate == N * task_rate
> :                    == N * task_ratelimit
> :                    == N * task_ratelimit_0                              (3)
> : Or
> :         task_ratelimit_0 = dirty_rate / N                               (4)
> :                           
> : So the balanced throttle bandwidth can be estimated by
> :                           
> :         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (5)
> :                           
> : Because with (4) and (5) we can get the desired equality (1):
> :                           
> :         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> :                       == write_bw / N

Then using the balance_rate we can compute task pause times like:

	task_pause = task->nr_dirtied / task_ratelimit

[ however all that still misses the primary feedback of:

   task_ratelimit_(i+1) = task_ratelimit_i * (write_bw / dirty_rate)

  there's still some confusion in the above due to task_ratelimit and
  balanced_rate.
]

However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.

> So if the dirty pages are ABOVE the setpoints, we throttle each task
> a bit more HEAVY than balanced_rate, so that the dirty pages are
> created less fast than they are cleaned, thus DROP to the setpoints
> (and the reverse). With that positional adjustment, the formula is
> transformed from
> 
>         task_ratelimit = balanced_rate
> 
> to
> 
>         task_ratelimit = balanced_rate * pos_ratio

> In terms of the negative feedback control theory, the
> bdi_position_ratio() function (control lines) can be expressed as
> 
> 1) f(setpoint) = 1.0
> 2) df/dt < 0
> 
> 3) optionally, abs(df/dt) should be large on large errors (= dirty -
>    setpoint) in order to cancel the errors fast, and be smaller when
>    dirty pages get closer to the setpoints in order to avoid overshooting.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-10 16:17         ` Peter Zijlstra
  0 siblings, 0 replies; 98+ messages in thread
From: Peter Zijlstra @ 2011-08-10 16:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

How about something like the below, it still needs some more work, but
its more or less complete in that is now explains both controls in one
story. The actual update bit is still missing.

---

balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.

IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:

	ratelimit = writeout_bandwidth

The fairness requirements gives us:

	task_ratelimit = write_bandwidth / N

> : When started N dd, we would like to throttle each dd at
> : 
> :          balanced_rate == write_bw / N                                  (1)
> : 
> : We don't know N beforehand, but still can estimate balanced_rate
> : within 200ms.
> : 
> : Start by throttling each dd task at rate
> : 
> :         task_ratelimit = task_ratelimit_0                               (2)
> :                          (any non-zero initial value is OK)
> : 
> : After 200ms, we got
> : 
> :         dirty_rate = # of pages dirtied by all dd's / 200ms
> :         write_bw   = # of pages written to the disk / 200ms
> : 
> : For the aggressive dd dirtiers, the equality holds
> : 
> :         dirty_rate == N * task_rate
> :                    == N * task_ratelimit
> :                    == N * task_ratelimit_0                              (3)
> : Or
> :         task_ratelimit_0 = dirty_rate / N                               (4)
> :                           
> : So the balanced throttle bandwidth can be estimated by
> :                           
> :         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (5)
> :                           
> : Because with (4) and (5) we can get the desired equality (1):
> :                           
> :         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> :                       == write_bw / N

Then using the balance_rate we can compute task pause times like:

	task_pause = task->nr_dirtied / task_ratelimit

[ however all that still misses the primary feedback of:

   task_ratelimit_(i+1) = task_ratelimit_i * (write_bw / dirty_rate)

  there's still some confusion in the above due to task_ratelimit and
  balanced_rate.
]

However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.

> So if the dirty pages are ABOVE the setpoints, we throttle each task
> a bit more HEAVY than balanced_rate, so that the dirty pages are
> created less fast than they are cleaned, thus DROP to the setpoints
> (and the reverse). With that positional adjustment, the formula is
> transformed from
> 
>         task_ratelimit = balanced_rate
> 
> to
> 
>         task_ratelimit = balanced_rate * pos_ratio

> In terms of the negative feedback control theory, the
> bdi_position_ratio() function (control lines) can be expressed as
> 
> 1) f(setpoint) = 1.0
> 2) df/dt < 0
> 
> 3) optionally, abs(df/dt) should be large on large errors (= dirty -
>    setpoint) in order to cancel the errors fast, and be smaller when
>    dirty pages get closer to the setpoints in order to avoid overshooting.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-09 17:02     ` Peter Zijlstra
@ 2011-08-10 14:15       ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-10 14:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 01:02:02AM +0800, Peter Zijlstra wrote:
> On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> 
> > +       pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
> > +       pos_bw++;  /* this avoids bdi->dirty_ratelimit get stuck in 0 */
> > +
> 
> > +       pos_ratio *= bdi->avg_write_bandwidth;
> > +       do_div(pos_ratio, dirty_bw | 1);
> > +       ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; 
> 
> when written out that results in:
> 
>            bw * pos_ratio * bdi->avg_write_bandwidth
>   ref_bw = -----------------------------------------
>                          dirty_bw
> 
> which would suggest you write it like:
> 
>   ref_bw = div_u64((u64)pos_bw * bdi->avg_write_bandwidth, dirty_bw | 1);
> 
> since pos_bw is already bw * pos_ratio per the above.

Good point. Oopse I even wrote a comment for the over complex calculation:

         * balanced_rate = pos_rate * write_bw / dirty_rate

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-10 14:15       ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-10 14:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 01:02:02AM +0800, Peter Zijlstra wrote:
> On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> 
> > +       pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
> > +       pos_bw++;  /* this avoids bdi->dirty_ratelimit get stuck in 0 */
> > +
> 
> > +       pos_ratio *= bdi->avg_write_bandwidth;
> > +       do_div(pos_ratio, dirty_bw | 1);
> > +       ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; 
> 
> when written out that results in:
> 
>            bw * pos_ratio * bdi->avg_write_bandwidth
>   ref_bw = -----------------------------------------
>                          dirty_bw
> 
> which would suggest you write it like:
> 
>   ref_bw = div_u64((u64)pos_bw * bdi->avg_write_bandwidth, dirty_bw | 1);
> 
> since pos_bw is already bw * pos_ratio per the above.

Good point. Oopse I even wrote a comment for the over complex calculation:

         * balanced_rate = pos_rate * write_bw / dirty_rate

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-09 16:56     ` Peter Zijlstra
  (?)
  (?)
@ 2011-08-10 14:10     ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-10 14:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 600 bytes --]

On Wed, Aug 10, 2011 at 12:56:56AM +0800, Peter Zijlstra wrote:
> On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> >              bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4;
> 
> I can't actually find this low-pass filter in the code.. could be I'm
> blind from staring at it too long though..

Sorry, it's implemented in another patch (attached). I've also removed
it from _this_ changelog.

Here you can find all the other patches in addition to the core bits.

http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=shortlog;h=refs/heads/dirty-throttling-v8%2B

Thanks,
Fengguang

[-- Attachment #2: smooth-base-bw --]
[-- Type: text/plain, Size: 2488 bytes --]

Subject: writeback: make dirty_ratelimit stable/smooth
Date: Thu Aug 04 22:05:05 CST 2011

Half the dirty_ratelimit update step size to avoid overshooting, and
further slow down the updates when the tracking error is smaller than
(base_rate / 8).

It's desirable to have a _constant_ dirty_ratelimit given a stable
workload. Because each jolt of dirty_ratelimit will directly show up
in all the bdi tasks' dirty rate.

The cost will be slightly increased dirty position error, which is
pretty acceptable.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-10 21:35:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-10 21:35:31.000000000 +0800
@@ -741,6 +741,7 @@ static void bdi_update_dirty_ratelimit(s
 	unsigned long dirty_rate;
 	unsigned long pos_rate;
 	unsigned long balanced_rate;
+	unsigned long delta;
 	unsigned long long pos_ratio;
 
 	/*
@@ -755,7 +756,6 @@ static void bdi_update_dirty_ratelimit(s
 	 * pos_rate reflects each dd's dirty rate enforced for the past 200ms.
 	 */
 	pos_rate = base_rate * pos_ratio >> BANDWIDTH_CALC_SHIFT;
-	pos_rate++;  /* this avoids bdi->dirty_ratelimit get stuck in 0 */
 
 	/*
 	 * balanced_rate = pos_rate * write_bw / dirty_rate
@@ -777,14 +777,32 @@ static void bdi_update_dirty_ratelimit(s
 	 * makes it more stable, but also is essential for preventing it being
 	 * driven away by possible systematic errors in balanced_rate.
 	 */
+	delta = 0;
 	if (base_rate > pos_rate) {
 		if (base_rate > balanced_rate)
-			base_rate = max(balanced_rate, pos_rate);
+			delta = base_rate - max(balanced_rate, pos_rate);
 	} else {
 		if (base_rate < balanced_rate)
-			base_rate = min(balanced_rate, pos_rate);
+			delta = min(balanced_rate, pos_rate) - base_rate;
 	}
 
+	/*
+	 * Don't pursue 100% rate matching. It's impossible since the balanced
+	 * rate itself is constantly fluctuating. So decrease the track speed
+	 * when it gets close to the target. Eliminates unnecessary jolting.
+	 */
+	delta >>= base_rate / (8 * delta + 1);
+	/*
+	 * Limit the step size to avoid overshooting. It also implicitly
+	 * prevents dirty_ratelimit from dropping to 0.
+	 */
+	delta >>= 2;
+
+	if (base_rate < pos_rate)
+		base_rate += delta;
+	else
+		base_rate -= delta;
+
 	bdi->dirty_ratelimit = base_rate;
 
 	trace_dirty_ratelimit(bdi, dirty_rate, pos_rate, balanced_rate);

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-09 16:19         ` Peter Zijlstra
@ 2011-08-10 14:07           ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-10 14:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 12:19:32AM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 18:16 +0200, Peter Zijlstra wrote:
> > On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote:
> > > 
> > > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
> > > limit (based on postion ratio, dirty_bw and write_bw). But this seems
> > > to be overall bdi limit and does not seem to take into account the
> > > number of tasks doing IO to that bdi (as your comment suggests). So
> > > it probably will track write_bw as opposed to write_bw/N. What am
> > > I missing? 
> > 
> > I think the per task thing comes from him using the pages_dirtied
> > argument to balance_dirty_pages() to compute the sleep time. Although
> > I'm not quite sure how he keeps fairness in light of the sleep time
> > bounding to MAX_PAUSE.
> 
> Furthermore, there's of course the issue that current->nr_dirtied is
> computed over all BDIs it dirtied pages from, and the sleep time is
> computed for the BDI it happened to do the overflowing write on.
> 
> Assuming an task (mostly) writes to a single bdi, or equally to all, it
> should all work out.

Right. That's one pitfall I forgot to mention, sorry.

If _really_ necessary, the above imperfection can be avoided by adding
tsk->last_dirty_bdi and tsk->to_pause, and to do so when switching to
another bdi:

        to_pause += nr_dirtied / task_ratelimit
        if (to_pause > reasonable_large_pause_time) {
                sleep(to_pause)
                to_pause = 0
        }
        nr_dirtied  = 0

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-10 14:07           ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-10 14:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 12:19:32AM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 18:16 +0200, Peter Zijlstra wrote:
> > On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote:
> > > 
> > > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
> > > limit (based on postion ratio, dirty_bw and write_bw). But this seems
> > > to be overall bdi limit and does not seem to take into account the
> > > number of tasks doing IO to that bdi (as your comment suggests). So
> > > it probably will track write_bw as opposed to write_bw/N. What am
> > > I missing? 
> > 
> > I think the per task thing comes from him using the pages_dirtied
> > argument to balance_dirty_pages() to compute the sleep time. Although
> > I'm not quite sure how he keeps fairness in light of the sleep time
> > bounding to MAX_PAUSE.
> 
> Furthermore, there's of course the issue that current->nr_dirtied is
> computed over all BDIs it dirtied pages from, and the sleep time is
> computed for the BDI it happened to do the overflowing write on.
> 
> Assuming an task (mostly) writes to a single bdi, or equally to all, it
> should all work out.

Right. That's one pitfall I forgot to mention, sorry.

If _really_ necessary, the above imperfection can be avoided by adding
tsk->last_dirty_bdi and tsk->to_pause, and to do so when switching to
another bdi:

        to_pause += nr_dirtied / task_ratelimit
        if (to_pause > reasonable_large_pause_time) {
                sleep(to_pause)
                to_pause = 0
        }
        nr_dirtied  = 0

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-09 16:16       ` Peter Zijlstra
@ 2011-08-10 14:00         ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-10 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 12:16:30AM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote:
> > 
> > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
> > limit (based on postion ratio, dirty_bw and write_bw). But this seems
> > to be overall bdi limit and does not seem to take into account the
> > number of tasks doing IO to that bdi (as your comment suggests).
> > So it probably will track write_bw as opposed to write_bw/N. What
> > am I missing? 

In normal situation (near the setpoints),

   task_ratelimit ~= bdi->dirty_ratelimit ~= write_bw / N

Yes, dirty_ratelimit is a per-bdi variable, because all tasks share
roughly the same dirty ratelimit for the obvious reason of fairness.
 
> I think the per task thing comes from him using the pages_dirtied
> argument to balance_dirty_pages() to compute the sleep time.

Yeah. Ultimately it will allow different tasks to be throttled at
different (user specified) rates.

> Although I'm not quite sure how he keeps fairness in light of the
> sleep time bounding to MAX_PAUSE.

Firstly, MAX_PAUSE will only be applied when the dirty pages rush
high (dirty exceeded).  Secondly, the dirty exceeded state is global
to all tasks, in which case each task will sleep for MAX_PAUSE equally.
So the fairness is still maintained in dirty exceeded state.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-10 14:00         ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-10 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 12:16:30AM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote:
> > 
> > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
> > limit (based on postion ratio, dirty_bw and write_bw). But this seems
> > to be overall bdi limit and does not seem to take into account the
> > number of tasks doing IO to that bdi (as your comment suggests).
> > So it probably will track write_bw as opposed to write_bw/N. What
> > am I missing? 

In normal situation (near the setpoints),

   task_ratelimit ~= bdi->dirty_ratelimit ~= write_bw / N

Yes, dirty_ratelimit is a per-bdi variable, because all tasks share
roughly the same dirty ratelimit for the obvious reason of fairness.
 
> I think the per task thing comes from him using the pages_dirtied
> argument to balance_dirty_pages() to compute the sleep time.

Yeah. Ultimately it will allow different tasks to be throttled at
different (user specified) rates.

> Although I'm not quite sure how he keeps fairness in light of the
> sleep time bounding to MAX_PAUSE.

Firstly, MAX_PAUSE will only be applied when the dirty pages rush
high (dirty exceeded).  Secondly, the dirty exceeded state is global
to all tasks, in which case each task will sleep for MAX_PAUSE equally.
So the fairness is still maintained in dirty exceeded state.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-09 14:57     ` Peter Zijlstra
@ 2011-08-10 11:07       ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-10 11:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 09, 2011 at 10:57:32PM +0800, Peter Zijlstra wrote:
> On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > 
> > Estimation of balanced bdi->dirty_ratelimit
> > ===========================================
> > 
> > When started N dd, throttle each dd at
> > 
> >          task_ratelimit = pos_bw (any non-zero initial value is OK)
> 
> This is (0), since it makes (1). But it fails to explain what the
> difference is between task_ratelimit and pos_bw (and why positional
> bandwidth is a good name).

Yeah it's (0) and is another form of the formula used in
balance_dirty_pages():

        rate = bdi->dirty_ratelimit * pos_ratio

In fact the estimation of ref_bw can take a more general form, by
writing (0) as

        task_ratelimit = task_ratelimit_0

where task_ratelimit_0 is any non-zero value balance_dirty_pages()
uses to throttle the tasks during that 200ms.

> > After 200ms, we got
> > 
> >          dirty_bw = # of pages dirtied by app / 200ms
> >          write_bw = # of pages written to disk / 200ms
> 
> Right, so that I get. And our premise for the whole work is to delay
> applications so that we match the dirty_bw to the write_bw, right?

Right, the balance target is (dirty_bw == write_bw),
but let's rename dirty_bw to dirty_rate as you suggested.

> > For aggressive dirtiers, the equality holds
> > 
> >          dirty_bw == N * task_ratelimit
> >                   == N * pos_bw                         (1)
> 
> So dirty_bw is in pages/s, so task_ratelimit should also be in pages/s,
> since N is a unit-less number.

Right.

> What does task_ratelimit in pages/s mean? Since we make the tasks sleep
> the only thing we can make from this is a measure of pages. So I expect
> (in a later patch) we compute the sleep time on the amount of pages we
> want written out, using this ratelimit measure, right?

Right. balance_dirty_pages() will use it this way (the variable name
used in code is 'bw', will change to 'rate'):

        pause = (HZ * pages_dirtied) / task_ratelimit

> > The balanced throttle bandwidth can be estimated by
> > 
> >          ref_bw = pos_bw * write_bw / dirty_bw          (2)
> 
> Here you introduce reference bandwidth, what does it mean and what is
> its relation to positional bandwidth. Going by the equation, we got
> (pages/s * pages/s) / (pages/s) so we indeed have a bandwidth unit.

Yeah. Or better do some renames:

          balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)    (2)

> write_bw/dirty_bw is the ration between output and input of dirty pages,
> but what is pos_bw and what does that make ref_bw?

It's (bdi->dirty_ratelimit * pos_ratio), the effective dirty rate
balance_dirty_pages() used to limit each bdi task for the past 200ms.

For example, if (task_ratelimit_0 = write_bw). Then the N dd tasks
will make bdi dirty rate (dirty_rate = N * task_ratelimit_0), and the
balanced ratelimit will be

        balanced_rate
        = task_ratelimit_0 * (write_bw / (N * task_ratelimit_0))
        = write_bw / N

Thus within 200ms, we get the estimation of balanced_rate without
knowing N beforehand.

> > >From (1) and (2), we get equality
> > 
> >          ref_bw == write_bw / N                         (3)
> 
> Somehow this seems like the primary postulate, yet you present it like a
> derivation. The whole purpose of your control system is to provide this
> fairness between processes, therefore I would expect you start out with
> this postulate and reason therefrom.

Good idea.

> > If the N dd's are all throttled at ref_bw, the dirty/writeback rates
> > will match. So ref_bw is the balanced dirty rate.
> 
> Which does lead to the question why its not called that instead ;-)

Sure, changed to balanced_rate :-)

> > In practice, the ref_bw calculated by (2) may fluctuate and have
> > estimation errors. So the bdi->dirty_ratelimit update policy is to
> > follow it only when both pos_bw and ref_bw point to the same direction
> > (indicating not only the dirty position has deviated from the global/bdi
> > setpoints, but also it's still departing away).
> 
> Which is where you introduce the need for pos_bw, yet you have not yet
> explained its meaning. In this explanation you allude to it being the
> speed (first time derivative) of the deviation from the setpoint.

That's right.

> The set point's measure is in pages, so the measure of its first time
> derivative would indeed be pages/s, just like bandwidth, but calling it
> a bandwidth seems highly confusing indeed.

Yeah, I'll rename the relevant vars *bw to *rate.

> I would also like a few more words on your update condition, why did you
> pick those, and what are the full ramifications of them.

OK.

> Also missing in this story is your pos_ratio thing, it is used in the
> code, but there is no explanation on how it ties in with the above
> things.

There are two control targets

(1) dirty setpoint
(2) dirty rate

pos_ratio does the position based control for (1). It's not inherently
relevant to the computation of balanced_rate. I hope the below rephrased
text will make it easier to understand.

: When started N dd, we would like to throttle each dd at
: 
:          balanced_rate == write_bw / N                                  (1)
: 
: We don't know N beforehand, but still can estimate balanced_rate
: within 200ms.
: 
: Start by throttling each dd task at rate
: 
:         task_ratelimit = task_ratelimit_0                               (2)
:                          (any non-zero initial value is OK)
: 
: After 200ms, we got
: 
:         dirty_rate = # of pages dirtied by all dd's / 200ms
:         write_bw   = # of pages written to the disk / 200ms
: 
: For the aggressive dd dirtiers, the equality holds
: 
:         dirty_rate == N * task_rate
:                    == N * task_ratelimit
:                    == N * task_ratelimit_0                              (3)
: Or
:         task_ratelimit_0 = dirty_rate / N                               (4)
:                           
: So the balanced throttle bandwidth can be estimated by
:                           
:         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (5)
:                           
: Because with (4) and (5) we can get the desired equality (1):
:                           
:         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
:                       == write_bw / N
:
: Since balance_dirty_pages() will be using
:        
:         task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio()    (6)
: 
:        
: Taking (5) and (6), we get the real formula used in the code
:                                                                  
:         balanced_rate = bdi->dirty_ratelimit * bdi_position_ratio() * 
:                                 (write_bw / dirty_rate)                 (7)
: 

> You seem very skilled in control systems (your earlier read-ahead work
> was also a very complex system),

Thank you! I majored in the college "Pattern Recognition and Intelligent
Systems" and "Control theory and Control Engineering", which happen to be
the perfect preparations for read-ahead and dirty balancing :)

> but the explanations of your systems are highly confusing.

Sorry for that!

> Can you go back to the roots and explain how you constructed your
> model and why you did so? (without using graphs please)

As mentioned above, the root requirements are

(1) position target: to keep dirty pages around the bdi/global setpoints
(2) rate target:     to keep bdi dirty rate around bdi write bandwidth

In order to meet (2), we try to estimate (balanced_rate = write_bw / N)
and use it to throttle the N dd tasks.

However that's not enough. When the dirty rate perfectly matches the
write bandwidth, the dirty pages can stay stationary at any point.  We
want the dirty pages to stay around the setpoints as required by (1).

So if the dirty pages are ABOVE the setpoints, we throttle each task
a bit more HEAVY than balanced_rate, so that the dirty pages are
created less fast than they are cleaned, thus DROP to the setpoints
(and the reverse). With that positional adjustment, the formula is
transformed from

        task_ratelimit = balanced_rate              => meets (2)

to

        task_ratelimit = balanced_rate * pos_ratio  => meets both (1),(2)

At last, due to the possible large fluctuations in the raw
balanced_rate value, the more stable bdi->dirty_ratelimit which tracks
balanced_rate in a conservative way is used, resulting in the final form

        task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio()

> PS. I'm not criticizing your work, the results are impressive (as
> always), but I find it very hard to understand. 
> 
> PPS. If it would help, feel free to refer me to educational material on
> control system theory, either online or in books.

Fortunately no fancy control theory is used here ;) Only the simple
theory of negative feedback control is used, which states that there
will be overshoots and ringing if trying to correct the errors way too
fast.

The overshooting concept can be explained in the graph of the below page,
where the step response can be a sudden start of dd reader that took
away all the disk write bandwidth.

http://en.wikipedia.org/wiki/Step_response

In terms of the negative feedback control theory, the
bdi_position_ratio() function (control lines) can be expressed as

1) f(setpoint) = 1.0
2) df/dt < 0

3) optionally, abs(df/dt) should be large on large errors (= dirty -
   setpoint) in order to cancel the errors fast, and be smaller when
   dirty pages get closer to the setpoints in order to avoid overshooting.

The principle of (3) will be implemented in some follow up patches :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-10 11:07       ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-10 11:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 09, 2011 at 10:57:32PM +0800, Peter Zijlstra wrote:
> On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > 
> > Estimation of balanced bdi->dirty_ratelimit
> > ===========================================
> > 
> > When started N dd, throttle each dd at
> > 
> >          task_ratelimit = pos_bw (any non-zero initial value is OK)
> 
> This is (0), since it makes (1). But it fails to explain what the
> difference is between task_ratelimit and pos_bw (and why positional
> bandwidth is a good name).

Yeah it's (0) and is another form of the formula used in
balance_dirty_pages():

        rate = bdi->dirty_ratelimit * pos_ratio

In fact the estimation of ref_bw can take a more general form, by
writing (0) as

        task_ratelimit = task_ratelimit_0

where task_ratelimit_0 is any non-zero value balance_dirty_pages()
uses to throttle the tasks during that 200ms.

> > After 200ms, we got
> > 
> >          dirty_bw = # of pages dirtied by app / 200ms
> >          write_bw = # of pages written to disk / 200ms
> 
> Right, so that I get. And our premise for the whole work is to delay
> applications so that we match the dirty_bw to the write_bw, right?

Right, the balance target is (dirty_bw == write_bw),
but let's rename dirty_bw to dirty_rate as you suggested.

> > For aggressive dirtiers, the equality holds
> > 
> >          dirty_bw == N * task_ratelimit
> >                   == N * pos_bw                         (1)
> 
> So dirty_bw is in pages/s, so task_ratelimit should also be in pages/s,
> since N is a unit-less number.

Right.

> What does task_ratelimit in pages/s mean? Since we make the tasks sleep
> the only thing we can make from this is a measure of pages. So I expect
> (in a later patch) we compute the sleep time on the amount of pages we
> want written out, using this ratelimit measure, right?

Right. balance_dirty_pages() will use it this way (the variable name
used in code is 'bw', will change to 'rate'):

        pause = (HZ * pages_dirtied) / task_ratelimit

> > The balanced throttle bandwidth can be estimated by
> > 
> >          ref_bw = pos_bw * write_bw / dirty_bw          (2)
> 
> Here you introduce reference bandwidth, what does it mean and what is
> its relation to positional bandwidth. Going by the equation, we got
> (pages/s * pages/s) / (pages/s) so we indeed have a bandwidth unit.

Yeah. Or better do some renames:

          balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)    (2)

> write_bw/dirty_bw is the ration between output and input of dirty pages,
> but what is pos_bw and what does that make ref_bw?

It's (bdi->dirty_ratelimit * pos_ratio), the effective dirty rate
balance_dirty_pages() used to limit each bdi task for the past 200ms.

For example, if (task_ratelimit_0 = write_bw). Then the N dd tasks
will make bdi dirty rate (dirty_rate = N * task_ratelimit_0), and the
balanced ratelimit will be

        balanced_rate
        = task_ratelimit_0 * (write_bw / (N * task_ratelimit_0))
        = write_bw / N

Thus within 200ms, we get the estimation of balanced_rate without
knowing N beforehand.

> > >From (1) and (2), we get equality
> > 
> >          ref_bw == write_bw / N                         (3)
> 
> Somehow this seems like the primary postulate, yet you present it like a
> derivation. The whole purpose of your control system is to provide this
> fairness between processes, therefore I would expect you start out with
> this postulate and reason therefrom.

Good idea.

> > If the N dd's are all throttled at ref_bw, the dirty/writeback rates
> > will match. So ref_bw is the balanced dirty rate.
> 
> Which does lead to the question why its not called that instead ;-)

Sure, changed to balanced_rate :-)

> > In practice, the ref_bw calculated by (2) may fluctuate and have
> > estimation errors. So the bdi->dirty_ratelimit update policy is to
> > follow it only when both pos_bw and ref_bw point to the same direction
> > (indicating not only the dirty position has deviated from the global/bdi
> > setpoints, but also it's still departing away).
> 
> Which is where you introduce the need for pos_bw, yet you have not yet
> explained its meaning. In this explanation you allude to it being the
> speed (first time derivative) of the deviation from the setpoint.

That's right.

> The set point's measure is in pages, so the measure of its first time
> derivative would indeed be pages/s, just like bandwidth, but calling it
> a bandwidth seems highly confusing indeed.

Yeah, I'll rename the relevant vars *bw to *rate.

> I would also like a few more words on your update condition, why did you
> pick those, and what are the full ramifications of them.

OK.

> Also missing in this story is your pos_ratio thing, it is used in the
> code, but there is no explanation on how it ties in with the above
> things.

There are two control targets

(1) dirty setpoint
(2) dirty rate

pos_ratio does the position based control for (1). It's not inherently
relevant to the computation of balanced_rate. I hope the below rephrased
text will make it easier to understand.

: When started N dd, we would like to throttle each dd at
: 
:          balanced_rate == write_bw / N                                  (1)
: 
: We don't know N beforehand, but still can estimate balanced_rate
: within 200ms.
: 
: Start by throttling each dd task at rate
: 
:         task_ratelimit = task_ratelimit_0                               (2)
:                          (any non-zero initial value is OK)
: 
: After 200ms, we got
: 
:         dirty_rate = # of pages dirtied by all dd's / 200ms
:         write_bw   = # of pages written to the disk / 200ms
: 
: For the aggressive dd dirtiers, the equality holds
: 
:         dirty_rate == N * task_rate
:                    == N * task_ratelimit
:                    == N * task_ratelimit_0                              (3)
: Or
:         task_ratelimit_0 = dirty_rate / N                               (4)
:                           
: So the balanced throttle bandwidth can be estimated by
:                           
:         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (5)
:                           
: Because with (4) and (5) we can get the desired equality (1):
:                           
:         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
:                       == write_bw / N
:
: Since balance_dirty_pages() will be using
:        
:         task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio()    (6)
: 
:        
: Taking (5) and (6), we get the real formula used in the code
:                                                                  
:         balanced_rate = bdi->dirty_ratelimit * bdi_position_ratio() * 
:                                 (write_bw / dirty_rate)                 (7)
: 

> You seem very skilled in control systems (your earlier read-ahead work
> was also a very complex system),

Thank you! I majored in the college "Pattern Recognition and Intelligent
Systems" and "Control theory and Control Engineering", which happen to be
the perfect preparations for read-ahead and dirty balancing :)

> but the explanations of your systems are highly confusing.

Sorry for that!

> Can you go back to the roots and explain how you constructed your
> model and why you did so? (without using graphs please)

As mentioned above, the root requirements are

(1) position target: to keep dirty pages around the bdi/global setpoints
(2) rate target:     to keep bdi dirty rate around bdi write bandwidth

In order to meet (2), we try to estimate (balanced_rate = write_bw / N)
and use it to throttle the N dd tasks.

However that's not enough. When the dirty rate perfectly matches the
write bandwidth, the dirty pages can stay stationary at any point.  We
want the dirty pages to stay around the setpoints as required by (1).

So if the dirty pages are ABOVE the setpoints, we throttle each task
a bit more HEAVY than balanced_rate, so that the dirty pages are
created less fast than they are cleaned, thus DROP to the setpoints
(and the reverse). With that positional adjustment, the formula is
transformed from

        task_ratelimit = balanced_rate              => meets (2)

to

        task_ratelimit = balanced_rate * pos_ratio  => meets both (1),(2)

At last, due to the possible large fluctuations in the raw
balanced_rate value, the more stable bdi->dirty_ratelimit which tracks
balanced_rate in a conservative way is used, resulting in the final form

        task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio()

> PS. I'm not criticizing your work, the results are impressive (as
> always), but I find it very hard to understand. 
> 
> PPS. If it would help, feel free to refer me to educational material on
> control system theory, either online or in books.

Fortunately no fancy control theory is used here ;) Only the simple
theory of negative feedback control is used, which states that there
will be overshoots and ringing if trying to correct the errors way too
fast.

The overshooting concept can be explained in the graph of the below page,
where the step response can be a sudden start of dd reader that took
away all the disk write bandwidth.

http://en.wikipedia.org/wiki/Step_response

In terms of the negative feedback control theory, the
bdi_position_ratio() function (control lines) can be expressed as

1) f(setpoint) = 1.0
2) df/dt < 0

3) optionally, abs(df/dt) should be large on large errors (= dirty -
   setpoint) in order to cancel the errors fast, and be smaller when
   dirty pages get closer to the setpoints in order to avoid overshooting.

The principle of (3) will be implemented in some follow up patches :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-06  8:44   ` Wu Fengguang
  (?)
@ 2011-08-09 17:02     ` Peter Zijlstra
  -1 siblings, 0 replies; 98+ messages in thread
From: Peter Zijlstra @ 2011-08-09 17:02 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:

> +       pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
> +       pos_bw++;  /* this avoids bdi->dirty_ratelimit get stuck in 0 */
> +

> +       pos_ratio *= bdi->avg_write_bandwidth;
> +       do_div(pos_ratio, dirty_bw | 1);
> +       ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; 

when written out that results in:

           bw * pos_ratio * bdi->avg_write_bandwidth
  ref_bw = -----------------------------------------
                         dirty_bw

which would suggest you write it like:

  ref_bw = div_u64((u64)pos_bw * bdi->avg_write_bandwidth, dirty_bw | 1);

since pos_bw is already bw * pos_ratio per the above.

Or am I missing something?

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 17:02     ` Peter Zijlstra
  0 siblings, 0 replies; 98+ messages in thread
From: Peter Zijlstra @ 2011-08-09 17:02 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:

> +       pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
> +       pos_bw++;  /* this avoids bdi->dirty_ratelimit get stuck in 0 */
> +

> +       pos_ratio *= bdi->avg_write_bandwidth;
> +       do_div(pos_ratio, dirty_bw | 1);
> +       ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; 

when written out that results in:

           bw * pos_ratio * bdi->avg_write_bandwidth
  ref_bw = -----------------------------------------
                         dirty_bw

which would suggest you write it like:

  ref_bw = div_u64((u64)pos_bw * bdi->avg_write_bandwidth, dirty_bw | 1);

since pos_bw is already bw * pos_ratio per the above.

Or am I missing something?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 17:02     ` Peter Zijlstra
  0 siblings, 0 replies; 98+ messages in thread
From: Peter Zijlstra @ 2011-08-09 17:02 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:

> +       pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
> +       pos_bw++;  /* this avoids bdi->dirty_ratelimit get stuck in 0 */
> +

> +       pos_ratio *= bdi->avg_write_bandwidth;
> +       do_div(pos_ratio, dirty_bw | 1);
> +       ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; 

when written out that results in:

           bw * pos_ratio * bdi->avg_write_bandwidth
  ref_bw = -----------------------------------------
                         dirty_bw

which would suggest you write it like:

  ref_bw = div_u64((u64)pos_bw * bdi->avg_write_bandwidth, dirty_bw | 1);

since pos_bw is already bw * pos_ratio per the above.

Or am I missing something?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-06  8:44   ` Wu Fengguang
  (?)
@ 2011-08-09 16:56     ` Peter Zijlstra
  -1 siblings, 0 replies; 98+ messages in thread
From: Peter Zijlstra @ 2011-08-09 16:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
>              bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4;

I can't actually find this low-pass filter in the code.. could be I'm
blind from staring at it too long though..

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 16:56     ` Peter Zijlstra
  0 siblings, 0 replies; 98+ messages in thread
From: Peter Zijlstra @ 2011-08-09 16:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
>              bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4;

I can't actually find this low-pass filter in the code.. could be I'm
blind from staring at it too long though..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 16:56     ` Peter Zijlstra
  0 siblings, 0 replies; 98+ messages in thread
From: Peter Zijlstra @ 2011-08-09 16:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
>              bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4;

I can't actually find this low-pass filter in the code.. could be I'm
blind from staring at it too long though..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-09 16:16       ` Peter Zijlstra
  (?)
@ 2011-08-09 16:19         ` Peter Zijlstra
  -1 siblings, 0 replies; 98+ messages in thread
From: Peter Zijlstra @ 2011-08-09 16:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 18:16 +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote:
> > 
> > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
> > limit (based on postion ratio, dirty_bw and write_bw). But this seems
> > to be overall bdi limit and does not seem to take into account the
> > number of tasks doing IO to that bdi (as your comment suggests). So
> > it probably will track write_bw as opposed to write_bw/N. What am
> > I missing? 
> 
> I think the per task thing comes from him using the pages_dirtied
> argument to balance_dirty_pages() to compute the sleep time. Although
> I'm not quite sure how he keeps fairness in light of the sleep time
> bounding to MAX_PAUSE.

Furthermore, there's of course the issue that current->nr_dirtied is
computed over all BDIs it dirtied pages from, and the sleep time is
computed for the BDI it happened to do the overflowing write on.

Assuming an task (mostly) writes to a single bdi, or equally to all, it
should all work out.



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 16:19         ` Peter Zijlstra
  0 siblings, 0 replies; 98+ messages in thread
From: Peter Zijlstra @ 2011-08-09 16:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 18:16 +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote:
> > 
> > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
> > limit (based on postion ratio, dirty_bw and write_bw). But this seems
> > to be overall bdi limit and does not seem to take into account the
> > number of tasks doing IO to that bdi (as your comment suggests). So
> > it probably will track write_bw as opposed to write_bw/N. What am
> > I missing? 
> 
> I think the per task thing comes from him using the pages_dirtied
> argument to balance_dirty_pages() to compute the sleep time. Although
> I'm not quite sure how he keeps fairness in light of the sleep time
> bounding to MAX_PAUSE.

Furthermore, there's of course the issue that current->nr_dirtied is
computed over all BDIs it dirtied pages from, and the sleep time is
computed for the BDI it happened to do the overflowing write on.

Assuming an task (mostly) writes to a single bdi, or equally to all, it
should all work out.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 16:19         ` Peter Zijlstra
  0 siblings, 0 replies; 98+ messages in thread
From: Peter Zijlstra @ 2011-08-09 16:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 18:16 +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote:
> > 
> > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
> > limit (based on postion ratio, dirty_bw and write_bw). But this seems
> > to be overall bdi limit and does not seem to take into account the
> > number of tasks doing IO to that bdi (as your comment suggests). So
> > it probably will track write_bw as opposed to write_bw/N. What am
> > I missing? 
> 
> I think the per task thing comes from him using the pages_dirtied
> argument to balance_dirty_pages() to compute the sleep time. Although
> I'm not quite sure how he keeps fairness in light of the sleep time
> bounding to MAX_PAUSE.

Furthermore, there's of course the issue that current->nr_dirtied is
computed over all BDIs it dirtied pages from, and the sleep time is
computed for the BDI it happened to do the overflowing write on.

Assuming an task (mostly) writes to a single bdi, or equally to all, it
should all work out.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-09 15:50     ` Vivek Goyal
  (?)
@ 2011-08-09 16:16       ` Peter Zijlstra
  -1 siblings, 0 replies; 98+ messages in thread
From: Peter Zijlstra @ 2011-08-09 16:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote:
> 
> So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
> limit (based on postion ratio, dirty_bw and write_bw). But this seems
> to be overall bdi limit and does not seem to take into account the
> number of tasks doing IO to that bdi (as your comment suggests). So
> it probably will track write_bw as opposed to write_bw/N. What am
> I missing? 

I think the per task thing comes from him using the pages_dirtied
argument to balance_dirty_pages() to compute the sleep time. Although
I'm not quite sure how he keeps fairness in light of the sleep time
bounding to MAX_PAUSE.



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 16:16       ` Peter Zijlstra
  0 siblings, 0 replies; 98+ messages in thread
From: Peter Zijlstra @ 2011-08-09 16:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote:
> 
> So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
> limit (based on postion ratio, dirty_bw and write_bw). But this seems
> to be overall bdi limit and does not seem to take into account the
> number of tasks doing IO to that bdi (as your comment suggests). So
> it probably will track write_bw as opposed to write_bw/N. What am
> I missing? 

I think the per task thing comes from him using the pages_dirtied
argument to balance_dirty_pages() to compute the sleep time. Although
I'm not quite sure how he keeps fairness in light of the sleep time
bounding to MAX_PAUSE.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 16:16       ` Peter Zijlstra
  0 siblings, 0 replies; 98+ messages in thread
From: Peter Zijlstra @ 2011-08-09 16:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote:
> 
> So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
> limit (based on postion ratio, dirty_bw and write_bw). But this seems
> to be overall bdi limit and does not seem to take into account the
> number of tasks doing IO to that bdi (as your comment suggests). So
> it probably will track write_bw as opposed to write_bw/N. What am
> I missing? 

I think the per task thing comes from him using the pages_dirtied
argument to balance_dirty_pages() to compute the sleep time. Although
I'm not quite sure how he keeps fairness in light of the sleep time
bounding to MAX_PAUSE.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-06  8:44   ` Wu Fengguang
@ 2011-08-09 15:50     ` Vivek Goyal
  -1 siblings, 0 replies; 98+ messages in thread
From: Vivek Goyal @ 2011-08-09 15:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:50PM +0800, Wu Fengguang wrote:

[..]
> +/*
> + * Maintain bdi->dirty_ratelimit, the base throttle bandwidth.
> + *
> + * Normal bdi tasks will be curbed at or below it in long term.
> + * Obviously it should be around (write_bw / N) when there are N dd tasks.
> + */

Hi Fengguang,

So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
limit (based on postion ratio, dirty_bw and write_bw). But this seems
to be overall bdi limit and does not seem to take into account the
number of tasks doing IO to that bdi (as your comment suggests). So
it probably will track write_bw as opposed to write_bw/N. What am
I missing?

Thanks
Vivek


> +static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
> +				       unsigned long thresh,
> +				       unsigned long dirty,
> +				       unsigned long bdi_thresh,
> +				       unsigned long bdi_dirty,
> +				       unsigned long dirtied,
> +				       unsigned long elapsed)
> +{
> +	unsigned long bw = bdi->dirty_ratelimit;
> +	unsigned long dirty_bw;
> +	unsigned long pos_bw;
> +	unsigned long ref_bw;
> +	unsigned long long pos_ratio;
> +
> +	/*
> +	 * The dirty rate will match the writeback rate in long term, except
> +	 * when dirty pages are truncated by userspace or re-dirtied by FS.
> +	 */
> +	dirty_bw = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
> +
> +	pos_ratio = bdi_position_ratio(bdi, thresh, dirty,
> +				       bdi_thresh, bdi_dirty);
> +	/*
> +	 * pos_bw reflects each dd's dirty rate enforced for the past 200ms.
> +	 */
> +	pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
> +	pos_bw++;  /* this avoids bdi->dirty_ratelimit get stuck in 0 */
> +
> +	/*
> +	 * ref_bw = pos_bw * write_bw / dirty_bw
> +	 *
> +	 * It's a linear estimation of the "balanced" throttle bandwidth.
> +	 */
> +	pos_ratio *= bdi->avg_write_bandwidth;
> +	do_div(pos_ratio, dirty_bw | 1);
> +	ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
> +
> +	/*
> +	 * dirty_ratelimit will follow ref_bw/pos_bw conservatively iff they
> +	 * are on the same side of dirty_ratelimit. Which not only makes it
> +	 * more stable, but also is essential for preventing it being driven
> +	 * away by possible systematic errors in ref_bw.
> +	 */
> +	if (pos_bw < bw) {
> +		if (ref_bw < bw)
> +			bw = max(ref_bw, pos_bw);
> +	} else {
> +		if (ref_bw > bw)
> +			bw = min(ref_bw, pos_bw);
> +	}
> +
> +	bdi->dirty_ratelimit = bw;
> +}
> +
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
>  			    unsigned long dirty,
> @@ -745,6 +805,7 @@ void __bdi_update_bandwidth(struct backi
>  {
>  	unsigned long now = jiffies;
>  	unsigned long elapsed = now - bdi->bw_time_stamp;
> +	unsigned long dirtied;
>  	unsigned long written;
>  
>  	/*
> @@ -753,6 +814,7 @@ void __bdi_update_bandwidth(struct backi
>  	if (elapsed < BANDWIDTH_INTERVAL)
>  		return;
>  
> +	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
>  	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
>  
>  	/*
> @@ -762,12 +824,15 @@ void __bdi_update_bandwidth(struct backi
>  	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
>  		goto snapshot;
>  
> -	if (thresh)
> +	if (thresh) {
>  		global_update_bandwidth(thresh, dirty, now);
> -
> +		bdi_update_dirty_ratelimit(bdi, thresh, dirty, bdi_thresh,
> +					   bdi_dirty, dirtied, elapsed);
> +	}
>  	bdi_update_write_bandwidth(bdi, elapsed, written);
>  
>  snapshot:
> +	bdi->dirtied_stamp = dirtied;
>  	bdi->written_stamp = written;
>  	bdi->bw_time_stamp = now;
>  }
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 15:50     ` Vivek Goyal
  0 siblings, 0 replies; 98+ messages in thread
From: Vivek Goyal @ 2011-08-09 15:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:50PM +0800, Wu Fengguang wrote:

[..]
> +/*
> + * Maintain bdi->dirty_ratelimit, the base throttle bandwidth.
> + *
> + * Normal bdi tasks will be curbed at or below it in long term.
> + * Obviously it should be around (write_bw / N) when there are N dd tasks.
> + */

Hi Fengguang,

So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
limit (based on postion ratio, dirty_bw and write_bw). But this seems
to be overall bdi limit and does not seem to take into account the
number of tasks doing IO to that bdi (as your comment suggests). So
it probably will track write_bw as opposed to write_bw/N. What am
I missing?

Thanks
Vivek


> +static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
> +				       unsigned long thresh,
> +				       unsigned long dirty,
> +				       unsigned long bdi_thresh,
> +				       unsigned long bdi_dirty,
> +				       unsigned long dirtied,
> +				       unsigned long elapsed)
> +{
> +	unsigned long bw = bdi->dirty_ratelimit;
> +	unsigned long dirty_bw;
> +	unsigned long pos_bw;
> +	unsigned long ref_bw;
> +	unsigned long long pos_ratio;
> +
> +	/*
> +	 * The dirty rate will match the writeback rate in long term, except
> +	 * when dirty pages are truncated by userspace or re-dirtied by FS.
> +	 */
> +	dirty_bw = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
> +
> +	pos_ratio = bdi_position_ratio(bdi, thresh, dirty,
> +				       bdi_thresh, bdi_dirty);
> +	/*
> +	 * pos_bw reflects each dd's dirty rate enforced for the past 200ms.
> +	 */
> +	pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
> +	pos_bw++;  /* this avoids bdi->dirty_ratelimit get stuck in 0 */
> +
> +	/*
> +	 * ref_bw = pos_bw * write_bw / dirty_bw
> +	 *
> +	 * It's a linear estimation of the "balanced" throttle bandwidth.
> +	 */
> +	pos_ratio *= bdi->avg_write_bandwidth;
> +	do_div(pos_ratio, dirty_bw | 1);
> +	ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
> +
> +	/*
> +	 * dirty_ratelimit will follow ref_bw/pos_bw conservatively iff they
> +	 * are on the same side of dirty_ratelimit. Which not only makes it
> +	 * more stable, but also is essential for preventing it being driven
> +	 * away by possible systematic errors in ref_bw.
> +	 */
> +	if (pos_bw < bw) {
> +		if (ref_bw < bw)
> +			bw = max(ref_bw, pos_bw);
> +	} else {
> +		if (ref_bw > bw)
> +			bw = min(ref_bw, pos_bw);
> +	}
> +
> +	bdi->dirty_ratelimit = bw;
> +}
> +
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
>  			    unsigned long dirty,
> @@ -745,6 +805,7 @@ void __bdi_update_bandwidth(struct backi
>  {
>  	unsigned long now = jiffies;
>  	unsigned long elapsed = now - bdi->bw_time_stamp;
> +	unsigned long dirtied;
>  	unsigned long written;
>  
>  	/*
> @@ -753,6 +814,7 @@ void __bdi_update_bandwidth(struct backi
>  	if (elapsed < BANDWIDTH_INTERVAL)
>  		return;
>  
> +	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
>  	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
>  
>  	/*
> @@ -762,12 +824,15 @@ void __bdi_update_bandwidth(struct backi
>  	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
>  		goto snapshot;
>  
> -	if (thresh)
> +	if (thresh) {
>  		global_update_bandwidth(thresh, dirty, now);
> -
> +		bdi_update_dirty_ratelimit(bdi, thresh, dirty, bdi_thresh,
> +					   bdi_dirty, dirtied, elapsed);
> +	}
>  	bdi_update_write_bandwidth(bdi, elapsed, written);
>  
>  snapshot:
> +	bdi->dirtied_stamp = dirtied;
>  	bdi->written_stamp = written;
>  	bdi->bw_time_stamp = now;
>  }
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-06  8:44   ` Wu Fengguang
  (?)
@ 2011-08-09 14:57     ` Peter Zijlstra
  -1 siblings, 0 replies; 98+ messages in thread
From: Peter Zijlstra @ 2011-08-09 14:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> 
> Estimation of balanced bdi->dirty_ratelimit
> ===========================================
> 
> When started N dd, throttle each dd at
> 
>          task_ratelimit = pos_bw (any non-zero initial value is OK)

This is (0), since it makes (1). But it fails to explain what the
difference is between task_ratelimit and pos_bw (and why positional
bandwidth is a good name).

> After 200ms, we got
> 
>          dirty_bw = # of pages dirtied by app / 200ms
>          write_bw = # of pages written to disk / 200ms

Right, so that I get. And our premise for the whole work is to delay
applications so that we match the dirty_bw to the write_bw, right?

> For aggressive dirtiers, the equality holds
> 
>          dirty_bw == N * task_ratelimit
>                   == N * pos_bw                         (1)

So dirty_bw is in pages/s, so task_ratelimit should also be in pages/s,
since N is a unit-less number.

What does task_ratelimit in pages/s mean? Since we make the tasks sleep
the only thing we can make from this is a measure of pages. So I expect
(in a later patch) we compute the sleep time on the amount of pages we
want written out, using this ratelimit measure, right?

> The balanced throttle bandwidth can be estimated by
> 
>          ref_bw = pos_bw * write_bw / dirty_bw          (2)

Here you introduce reference bandwidth, what does it mean and what is
its relation to positional bandwidth. Going by the equation, we got
(pages/s * pages/s) / (pages/s) so we indeed have a bandwidth unit.

write_bw/dirty_bw is the ration between output and input of dirty pages,
but what is pos_bw and what does that make ref_bw?

> >From (1) and (2), we get equality
> 
>          ref_bw == write_bw / N                         (3)

Somehow this seems like the primary postulate, yet you present it like a
derivation. The whole purpose of your control system is to provide this
fairness between processes, therefore I would expect you start out with
this postulate and reason therefrom.

> If the N dd's are all throttled at ref_bw, the dirty/writeback rates
> will match. So ref_bw is the balanced dirty rate.

Which does lead to the question why its not called that instead ;-)

> In practice, the ref_bw calculated by (2) may fluctuate and have
> estimation errors. So the bdi->dirty_ratelimit update policy is to
> follow it only when both pos_bw and ref_bw point to the same direction
> (indicating not only the dirty position has deviated from the global/bdi
> setpoints, but also it's still departing away).

Which is where you introduce the need for pos_bw, yet you have not yet
explained its meaning. In this explanation you allude to it being the
speed (first time derivative) of the deviation from the setpoint.

The set point's measure is in pages, so the measure of its first time
derivative would indeed be pages/s, just like bandwidth, but calling it
a bandwidth seems highly confusing indeed.

I would also like a few more words on your update condition, why did you
pick those, and what are the full ramifications of them.

Also missing in this story is your pos_ratio thing, it is used in the
code, but there is no explanation on how it ties in with the above
things.


You seem very skilled in control systems (your earlier read-ahead work
was also a very complex system), but the explanations of your systems
are highly confusing. Can you go back to the roots and explain how you
constructed your model and why you did so? (without using graphs please)


PS. I'm not criticizing your work, the results are impressive (as
always), but I find it very hard to understand. 

PPS. If it would help, feel free to refer me to educational material on
control system theory, either online or in books.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 14:57     ` Peter Zijlstra
  0 siblings, 0 replies; 98+ messages in thread
From: Peter Zijlstra @ 2011-08-09 14:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> 
> Estimation of balanced bdi->dirty_ratelimit
> ===========================================
> 
> When started N dd, throttle each dd at
> 
>          task_ratelimit = pos_bw (any non-zero initial value is OK)

This is (0), since it makes (1). But it fails to explain what the
difference is between task_ratelimit and pos_bw (and why positional
bandwidth is a good name).

> After 200ms, we got
> 
>          dirty_bw = # of pages dirtied by app / 200ms
>          write_bw = # of pages written to disk / 200ms

Right, so that I get. And our premise for the whole work is to delay
applications so that we match the dirty_bw to the write_bw, right?

> For aggressive dirtiers, the equality holds
> 
>          dirty_bw == N * task_ratelimit
>                   == N * pos_bw                         (1)

So dirty_bw is in pages/s, so task_ratelimit should also be in pages/s,
since N is a unit-less number.

What does task_ratelimit in pages/s mean? Since we make the tasks sleep
the only thing we can make from this is a measure of pages. So I expect
(in a later patch) we compute the sleep time on the amount of pages we
want written out, using this ratelimit measure, right?

> The balanced throttle bandwidth can be estimated by
> 
>          ref_bw = pos_bw * write_bw / dirty_bw          (2)

Here you introduce reference bandwidth, what does it mean and what is
its relation to positional bandwidth. Going by the equation, we got
(pages/s * pages/s) / (pages/s) so we indeed have a bandwidth unit.

write_bw/dirty_bw is the ration between output and input of dirty pages,
but what is pos_bw and what does that make ref_bw?

> >From (1) and (2), we get equality
> 
>          ref_bw == write_bw / N                         (3)

Somehow this seems like the primary postulate, yet you present it like a
derivation. The whole purpose of your control system is to provide this
fairness between processes, therefore I would expect you start out with
this postulate and reason therefrom.

> If the N dd's are all throttled at ref_bw, the dirty/writeback rates
> will match. So ref_bw is the balanced dirty rate.

Which does lead to the question why its not called that instead ;-)

> In practice, the ref_bw calculated by (2) may fluctuate and have
> estimation errors. So the bdi->dirty_ratelimit update policy is to
> follow it only when both pos_bw and ref_bw point to the same direction
> (indicating not only the dirty position has deviated from the global/bdi
> setpoints, but also it's still departing away).

Which is where you introduce the need for pos_bw, yet you have not yet
explained its meaning. In this explanation you allude to it being the
speed (first time derivative) of the deviation from the setpoint.

The set point's measure is in pages, so the measure of its first time
derivative would indeed be pages/s, just like bandwidth, but calling it
a bandwidth seems highly confusing indeed.

I would also like a few more words on your update condition, why did you
pick those, and what are the full ramifications of them.

Also missing in this story is your pos_ratio thing, it is used in the
code, but there is no explanation on how it ties in with the above
things.


You seem very skilled in control systems (your earlier read-ahead work
was also a very complex system), but the explanations of your systems
are highly confusing. Can you go back to the roots and explain how you
constructed your model and why you did so? (without using graphs please)


PS. I'm not criticizing your work, the results are impressive (as
always), but I find it very hard to understand. 

PPS. If it would help, feel free to refer me to educational material on
control system theory, either online or in books.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 14:57     ` Peter Zijlstra
  0 siblings, 0 replies; 98+ messages in thread
From: Peter Zijlstra @ 2011-08-09 14:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> 
> Estimation of balanced bdi->dirty_ratelimit
> ===========================================
> 
> When started N dd, throttle each dd at
> 
>          task_ratelimit = pos_bw (any non-zero initial value is OK)

This is (0), since it makes (1). But it fails to explain what the
difference is between task_ratelimit and pos_bw (and why positional
bandwidth is a good name).

> After 200ms, we got
> 
>          dirty_bw = # of pages dirtied by app / 200ms
>          write_bw = # of pages written to disk / 200ms

Right, so that I get. And our premise for the whole work is to delay
applications so that we match the dirty_bw to the write_bw, right?

> For aggressive dirtiers, the equality holds
> 
>          dirty_bw == N * task_ratelimit
>                   == N * pos_bw                         (1)

So dirty_bw is in pages/s, so task_ratelimit should also be in pages/s,
since N is a unit-less number.

What does task_ratelimit in pages/s mean? Since we make the tasks sleep
the only thing we can make from this is a measure of pages. So I expect
(in a later patch) we compute the sleep time on the amount of pages we
want written out, using this ratelimit measure, right?

> The balanced throttle bandwidth can be estimated by
> 
>          ref_bw = pos_bw * write_bw / dirty_bw          (2)

Here you introduce reference bandwidth, what does it mean and what is
its relation to positional bandwidth. Going by the equation, we got
(pages/s * pages/s) / (pages/s) so we indeed have a bandwidth unit.

write_bw/dirty_bw is the ration between output and input of dirty pages,
but what is pos_bw and what does that make ref_bw?

> >From (1) and (2), we get equality
> 
>          ref_bw == write_bw / N                         (3)

Somehow this seems like the primary postulate, yet you present it like a
derivation. The whole purpose of your control system is to provide this
fairness between processes, therefore I would expect you start out with
this postulate and reason therefrom.

> If the N dd's are all throttled at ref_bw, the dirty/writeback rates
> will match. So ref_bw is the balanced dirty rate.

Which does lead to the question why its not called that instead ;-)

> In practice, the ref_bw calculated by (2) may fluctuate and have
> estimation errors. So the bdi->dirty_ratelimit update policy is to
> follow it only when both pos_bw and ref_bw point to the same direction
> (indicating not only the dirty position has deviated from the global/bdi
> setpoints, but also it's still departing away).

Which is where you introduce the need for pos_bw, yet you have not yet
explained its meaning. In this explanation you allude to it being the
speed (first time derivative) of the deviation from the setpoint.

The set point's measure is in pages, so the measure of its first time
derivative would indeed be pages/s, just like bandwidth, but calling it
a bandwidth seems highly confusing indeed.

I would also like a few more words on your update condition, why did you
pick those, and what are the full ramifications of them.

Also missing in this story is your pos_ratio thing, it is used in the
code, but there is no explanation on how it ties in with the above
things.


You seem very skilled in control systems (your earlier read-ahead work
was also a very complex system), but the explanations of your systems
are highly confusing. Can you go back to the roots and explain how you
constructed your model and why you did so? (without using graphs please)


PS. I'm not criticizing your work, the results are impressive (as
always), but I find it very hard to understand. 

PPS. If it would help, feel free to refer me to educational material on
control system theory, either online or in books.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-06  8:44   ` Wu Fengguang
@ 2011-08-09 14:54     ` Vivek Goyal
  -1 siblings, 0 replies; 98+ messages in thread
From: Vivek Goyal @ 2011-08-09 14:54 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:50PM +0800, Wu Fengguang wrote:
> It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
> when there are N dd tasks.
> 
> On write() syscall, use bdi->dirty_ratelimit
> ============================================
> 
>     balance_dirty_pages(pages_dirtied)
>     {
>         pos_bw = bdi->dirty_ratelimit * bdi_position_ratio();
>         pause = pages_dirtied / pos_bw;
>         sleep(pause);
>     }
> 
> On every 200ms, update bdi->dirty_ratelimit
> ===========================================
> 
>     bdi_update_dirty_ratelimit()
>     {
>         bw = bdi->dirty_ratelimit;
>         ref_bw = bw * bdi_position_ratio() * write_bw / dirty_bw;
>         if (dirty pages unbalanced)
>              bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4;
>     }
> 
> Estimation of balanced bdi->dirty_ratelimit
> ===========================================
> 
> When started N dd, throttle each dd at
> 
>          task_ratelimit = pos_bw (any non-zero initial value is OK)
> 
> After 200ms, we got
> 
>          dirty_bw = # of pages dirtied by app / 200ms
>          write_bw = # of pages written to disk / 200ms
> 
> For aggressive dirtiers, the equality holds
> 
>          dirty_bw == N * task_ratelimit
>                   == N * pos_bw                      	(1)
> 
> The balanced throttle bandwidth can be estimated by
> 
>          ref_bw = pos_bw * write_bw / dirty_bw       	(2)
> 
> >From (1) and (2), we get equality
> 
>          ref_bw == write_bw / N                      	(3)
> 
> If the N dd's are all throttled at ref_bw, the dirty/writeback rates
> will match. So ref_bw is the balanced dirty rate.

Hi Fengguang,

So how much work it is to extend all this to handle the case of cgroups?
IOW, I would imagine that you shall have to keep track of per cgroup/per
bdi state of many of the variables. For example, write_bw will become
per cgroup/per bdi entity instead of per bdi entity only. Same should
be true for position ratio, dirty_bw etc?

I am assuming that if some cgroup is low weight on end device, then
WRITE bandwidth of that cgroup should go down and that should be
accounted for at per bdi state and task throttling should happen
accordingly so that a lower weight cgroup tasks get throttled more
as compared to higher weight cgroup tasks?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 14:54     ` Vivek Goyal
  0 siblings, 0 replies; 98+ messages in thread
From: Vivek Goyal @ 2011-08-09 14:54 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:50PM +0800, Wu Fengguang wrote:
> It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
> when there are N dd tasks.
> 
> On write() syscall, use bdi->dirty_ratelimit
> ============================================
> 
>     balance_dirty_pages(pages_dirtied)
>     {
>         pos_bw = bdi->dirty_ratelimit * bdi_position_ratio();
>         pause = pages_dirtied / pos_bw;
>         sleep(pause);
>     }
> 
> On every 200ms, update bdi->dirty_ratelimit
> ===========================================
> 
>     bdi_update_dirty_ratelimit()
>     {
>         bw = bdi->dirty_ratelimit;
>         ref_bw = bw * bdi_position_ratio() * write_bw / dirty_bw;
>         if (dirty pages unbalanced)
>              bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4;
>     }
> 
> Estimation of balanced bdi->dirty_ratelimit
> ===========================================
> 
> When started N dd, throttle each dd at
> 
>          task_ratelimit = pos_bw (any non-zero initial value is OK)
> 
> After 200ms, we got
> 
>          dirty_bw = # of pages dirtied by app / 200ms
>          write_bw = # of pages written to disk / 200ms
> 
> For aggressive dirtiers, the equality holds
> 
>          dirty_bw == N * task_ratelimit
>                   == N * pos_bw                      	(1)
> 
> The balanced throttle bandwidth can be estimated by
> 
>          ref_bw = pos_bw * write_bw / dirty_bw       	(2)
> 
> >From (1) and (2), we get equality
> 
>          ref_bw == write_bw / N                      	(3)
> 
> If the N dd's are all throttled at ref_bw, the dirty/writeback rates
> will match. So ref_bw is the balanced dirty rate.

Hi Fengguang,

So how much work it is to extend all this to handle the case of cgroups?
IOW, I would imagine that you shall have to keep track of per cgroup/per
bdi state of many of the variables. For example, write_bw will become
per cgroup/per bdi entity instead of per bdi entity only. Same should
be true for position ratio, dirty_bw etc?

I am assuming that if some cgroup is low weight on end device, then
WRITE bandwidth of that cgroup should go down and that should be
accounted for at per bdi state and task throttling should happen
accordingly so that a lower weight cgroup tasks get throttled more
as compared to higher weight cgroup tasks?

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 3/5] writeback: dirty rate control
  2011-08-06  8:44 [PATCH 0/5] IO-less dirty throttling v8 Wu Fengguang
  2011-08-06  8:44   ` Wu Fengguang
@ 2011-08-06  8:44   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: dirty-ratelimit --]
[-- Type: text/plain, Size: 6415 bytes --]

It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
when there are N dd tasks.

On write() syscall, use bdi->dirty_ratelimit
============================================

    balance_dirty_pages(pages_dirtied)
    {
        pos_bw = bdi->dirty_ratelimit * bdi_position_ratio();
        pause = pages_dirtied / pos_bw;
        sleep(pause);
    }

On every 200ms, update bdi->dirty_ratelimit
===========================================

    bdi_update_dirty_ratelimit()
    {
        bw = bdi->dirty_ratelimit;
        ref_bw = bw * bdi_position_ratio() * write_bw / dirty_bw;
        if (dirty pages unbalanced)
             bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4;
    }

Estimation of balanced bdi->dirty_ratelimit
===========================================

When started N dd, throttle each dd at

         task_ratelimit = pos_bw (any non-zero initial value is OK)

After 200ms, we got

         dirty_bw = # of pages dirtied by app / 200ms
         write_bw = # of pages written to disk / 200ms

For aggressive dirtiers, the equality holds

         dirty_bw == N * task_ratelimit
                  == N * pos_bw                      	(1)

The balanced throttle bandwidth can be estimated by

         ref_bw = pos_bw * write_bw / dirty_bw       	(2)

>From (1) and (2), we get equality

         ref_bw == write_bw / N                      	(3)

If the N dd's are all throttled at ref_bw, the dirty/writeback rates
will match. So ref_bw is the balanced dirty rate.

In practice, the ref_bw calculated by (2) may fluctuate and have
estimation errors. So the bdi->dirty_ratelimit update policy is to
follow it only when both pos_bw and ref_bw point to the same direction
(indicating not only the dirty position has deviated from the global/bdi
setpoints, but also it's still departing away).

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    7 +++
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |   69 +++++++++++++++++++++++++++++++++-
 3 files changed, 75 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2011-08-05 18:05:36.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-05 18:05:36.000000000 +0800
@@ -75,10 +75,17 @@ struct backing_dev_info {
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	unsigned long bw_time_stamp;	/* last time write bw is updated */
+	unsigned long dirtied_stamp;
 	unsigned long written_stamp;	/* pages written at bw_time_stamp */
 	unsigned long write_bandwidth;	/* the estimated write bandwidth */
 	unsigned long avg_write_bandwidth; /* further smoothed write bw */
 
+	/*
+	 * The base throttle bandwidth, re-calculated on every 200ms.
+	 * All the bdi tasks' dirty rate will be curbed under it.
+	 */
+	unsigned long dirty_ratelimit;
+
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
--- linux-next.orig/mm/backing-dev.c	2011-08-05 18:05:36.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-08-05 18:05:36.000000000 +0800
@@ -674,6 +674,7 @@ int bdi_init(struct backing_dev_info *bd
 	bdi->bw_time_stamp = jiffies;
 	bdi->written_stamp = 0;
 
+	bdi->dirty_ratelimit = INIT_BW;
 	bdi->write_bandwidth = INIT_BW;
 	bdi->avg_write_bandwidth = INIT_BW;
 
--- linux-next.orig/mm/page-writeback.c	2011-08-05 18:05:36.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-06 09:08:35.000000000 +0800
@@ -736,6 +736,66 @@ static void global_update_bandwidth(unsi
 	spin_unlock(&dirty_lock);
 }
 
+/*
+ * Maintain bdi->dirty_ratelimit, the base throttle bandwidth.
+ *
+ * Normal bdi tasks will be curbed at or below it in long term.
+ * Obviously it should be around (write_bw / N) when there are N dd tasks.
+ */
+static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
+				       unsigned long thresh,
+				       unsigned long dirty,
+				       unsigned long bdi_thresh,
+				       unsigned long bdi_dirty,
+				       unsigned long dirtied,
+				       unsigned long elapsed)
+{
+	unsigned long bw = bdi->dirty_ratelimit;
+	unsigned long dirty_bw;
+	unsigned long pos_bw;
+	unsigned long ref_bw;
+	unsigned long long pos_ratio;
+
+	/*
+	 * The dirty rate will match the writeback rate in long term, except
+	 * when dirty pages are truncated by userspace or re-dirtied by FS.
+	 */
+	dirty_bw = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+
+	pos_ratio = bdi_position_ratio(bdi, thresh, dirty,
+				       bdi_thresh, bdi_dirty);
+	/*
+	 * pos_bw reflects each dd's dirty rate enforced for the past 200ms.
+	 */
+	pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
+	pos_bw++;  /* this avoids bdi->dirty_ratelimit get stuck in 0 */
+
+	/*
+	 * ref_bw = pos_bw * write_bw / dirty_bw
+	 *
+	 * It's a linear estimation of the "balanced" throttle bandwidth.
+	 */
+	pos_ratio *= bdi->avg_write_bandwidth;
+	do_div(pos_ratio, dirty_bw | 1);
+	ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
+
+	/*
+	 * dirty_ratelimit will follow ref_bw/pos_bw conservatively iff they
+	 * are on the same side of dirty_ratelimit. Which not only makes it
+	 * more stable, but also is essential for preventing it being driven
+	 * away by possible systematic errors in ref_bw.
+	 */
+	if (pos_bw < bw) {
+		if (ref_bw < bw)
+			bw = max(ref_bw, pos_bw);
+	} else {
+		if (ref_bw > bw)
+			bw = min(ref_bw, pos_bw);
+	}
+
+	bdi->dirty_ratelimit = bw;
+}
+
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
 			    unsigned long dirty,
@@ -745,6 +805,7 @@ void __bdi_update_bandwidth(struct backi
 {
 	unsigned long now = jiffies;
 	unsigned long elapsed = now - bdi->bw_time_stamp;
+	unsigned long dirtied;
 	unsigned long written;
 
 	/*
@@ -753,6 +814,7 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed < BANDWIDTH_INTERVAL)
 		return;
 
+	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
 
 	/*
@@ -762,12 +824,15 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
 		goto snapshot;
 
-	if (thresh)
+	if (thresh) {
 		global_update_bandwidth(thresh, dirty, now);
-
+		bdi_update_dirty_ratelimit(bdi, thresh, dirty, bdi_thresh,
+					   bdi_dirty, dirtied, elapsed);
+	}
 	bdi_update_write_bandwidth(bdi, elapsed, written);
 
 snapshot:
+	bdi->dirtied_stamp = dirtied;
 	bdi->written_stamp = written;
 	bdi->bw_time_stamp = now;
 }



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 3/5] writeback: dirty rate control
@ 2011-08-06  8:44   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: dirty-ratelimit --]
[-- Type: text/plain, Size: 6718 bytes --]

It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
when there are N dd tasks.

On write() syscall, use bdi->dirty_ratelimit
============================================

    balance_dirty_pages(pages_dirtied)
    {
        pos_bw = bdi->dirty_ratelimit * bdi_position_ratio();
        pause = pages_dirtied / pos_bw;
        sleep(pause);
    }

On every 200ms, update bdi->dirty_ratelimit
===========================================

    bdi_update_dirty_ratelimit()
    {
        bw = bdi->dirty_ratelimit;
        ref_bw = bw * bdi_position_ratio() * write_bw / dirty_bw;
        if (dirty pages unbalanced)
             bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4;
    }

Estimation of balanced bdi->dirty_ratelimit
===========================================

When started N dd, throttle each dd at

         task_ratelimit = pos_bw (any non-zero initial value is OK)

After 200ms, we got

         dirty_bw = # of pages dirtied by app / 200ms
         write_bw = # of pages written to disk / 200ms

For aggressive dirtiers, the equality holds

         dirty_bw == N * task_ratelimit
                  == N * pos_bw                      	(1)

The balanced throttle bandwidth can be estimated by

         ref_bw = pos_bw * write_bw / dirty_bw       	(2)

>From (1) and (2), we get equality

         ref_bw == write_bw / N                      	(3)

If the N dd's are all throttled at ref_bw, the dirty/writeback rates
will match. So ref_bw is the balanced dirty rate.

In practice, the ref_bw calculated by (2) may fluctuate and have
estimation errors. So the bdi->dirty_ratelimit update policy is to
follow it only when both pos_bw and ref_bw point to the same direction
(indicating not only the dirty position has deviated from the global/bdi
setpoints, but also it's still departing away).

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    7 +++
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |   69 +++++++++++++++++++++++++++++++++-
 3 files changed, 75 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2011-08-05 18:05:36.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-05 18:05:36.000000000 +0800
@@ -75,10 +75,17 @@ struct backing_dev_info {
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	unsigned long bw_time_stamp;	/* last time write bw is updated */
+	unsigned long dirtied_stamp;
 	unsigned long written_stamp;	/* pages written at bw_time_stamp */
 	unsigned long write_bandwidth;	/* the estimated write bandwidth */
 	unsigned long avg_write_bandwidth; /* further smoothed write bw */
 
+	/*
+	 * The base throttle bandwidth, re-calculated on every 200ms.
+	 * All the bdi tasks' dirty rate will be curbed under it.
+	 */
+	unsigned long dirty_ratelimit;
+
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
--- linux-next.orig/mm/backing-dev.c	2011-08-05 18:05:36.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-08-05 18:05:36.000000000 +0800
@@ -674,6 +674,7 @@ int bdi_init(struct backing_dev_info *bd
 	bdi->bw_time_stamp = jiffies;
 	bdi->written_stamp = 0;
 
+	bdi->dirty_ratelimit = INIT_BW;
 	bdi->write_bandwidth = INIT_BW;
 	bdi->avg_write_bandwidth = INIT_BW;
 
--- linux-next.orig/mm/page-writeback.c	2011-08-05 18:05:36.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-06 09:08:35.000000000 +0800
@@ -736,6 +736,66 @@ static void global_update_bandwidth(unsi
 	spin_unlock(&dirty_lock);
 }
 
+/*
+ * Maintain bdi->dirty_ratelimit, the base throttle bandwidth.
+ *
+ * Normal bdi tasks will be curbed at or below it in long term.
+ * Obviously it should be around (write_bw / N) when there are N dd tasks.
+ */
+static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
+				       unsigned long thresh,
+				       unsigned long dirty,
+				       unsigned long bdi_thresh,
+				       unsigned long bdi_dirty,
+				       unsigned long dirtied,
+				       unsigned long elapsed)
+{
+	unsigned long bw = bdi->dirty_ratelimit;
+	unsigned long dirty_bw;
+	unsigned long pos_bw;
+	unsigned long ref_bw;
+	unsigned long long pos_ratio;
+
+	/*
+	 * The dirty rate will match the writeback rate in long term, except
+	 * when dirty pages are truncated by userspace or re-dirtied by FS.
+	 */
+	dirty_bw = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+
+	pos_ratio = bdi_position_ratio(bdi, thresh, dirty,
+				       bdi_thresh, bdi_dirty);
+	/*
+	 * pos_bw reflects each dd's dirty rate enforced for the past 200ms.
+	 */
+	pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
+	pos_bw++;  /* this avoids bdi->dirty_ratelimit get stuck in 0 */
+
+	/*
+	 * ref_bw = pos_bw * write_bw / dirty_bw
+	 *
+	 * It's a linear estimation of the "balanced" throttle bandwidth.
+	 */
+	pos_ratio *= bdi->avg_write_bandwidth;
+	do_div(pos_ratio, dirty_bw | 1);
+	ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
+
+	/*
+	 * dirty_ratelimit will follow ref_bw/pos_bw conservatively iff they
+	 * are on the same side of dirty_ratelimit. Which not only makes it
+	 * more stable, but also is essential for preventing it being driven
+	 * away by possible systematic errors in ref_bw.
+	 */
+	if (pos_bw < bw) {
+		if (ref_bw < bw)
+			bw = max(ref_bw, pos_bw);
+	} else {
+		if (ref_bw > bw)
+			bw = min(ref_bw, pos_bw);
+	}
+
+	bdi->dirty_ratelimit = bw;
+}
+
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
 			    unsigned long dirty,
@@ -745,6 +805,7 @@ void __bdi_update_bandwidth(struct backi
 {
 	unsigned long now = jiffies;
 	unsigned long elapsed = now - bdi->bw_time_stamp;
+	unsigned long dirtied;
 	unsigned long written;
 
 	/*
@@ -753,6 +814,7 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed < BANDWIDTH_INTERVAL)
 		return;
 
+	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
 
 	/*
@@ -762,12 +824,15 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
 		goto snapshot;
 
-	if (thresh)
+	if (thresh) {
 		global_update_bandwidth(thresh, dirty, now);
-
+		bdi_update_dirty_ratelimit(bdi, thresh, dirty, bdi_thresh,
+					   bdi_dirty, dirtied, elapsed);
+	}
 	bdi_update_write_bandwidth(bdi, elapsed, written);
 
 snapshot:
+	bdi->dirtied_stamp = dirtied;
 	bdi->written_stamp = written;
 	bdi->bw_time_stamp = now;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 3/5] writeback: dirty rate control
@ 2011-08-06  8:44   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: dirty-ratelimit --]
[-- Type: text/plain, Size: 6718 bytes --]

It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
when there are N dd tasks.

On write() syscall, use bdi->dirty_ratelimit
============================================

    balance_dirty_pages(pages_dirtied)
    {
        pos_bw = bdi->dirty_ratelimit * bdi_position_ratio();
        pause = pages_dirtied / pos_bw;
        sleep(pause);
    }

On every 200ms, update bdi->dirty_ratelimit
===========================================

    bdi_update_dirty_ratelimit()
    {
        bw = bdi->dirty_ratelimit;
        ref_bw = bw * bdi_position_ratio() * write_bw / dirty_bw;
        if (dirty pages unbalanced)
             bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4;
    }

Estimation of balanced bdi->dirty_ratelimit
===========================================

When started N dd, throttle each dd at

         task_ratelimit = pos_bw (any non-zero initial value is OK)

After 200ms, we got

         dirty_bw = # of pages dirtied by app / 200ms
         write_bw = # of pages written to disk / 200ms

For aggressive dirtiers, the equality holds

         dirty_bw == N * task_ratelimit
                  == N * pos_bw                      	(1)

The balanced throttle bandwidth can be estimated by

         ref_bw = pos_bw * write_bw / dirty_bw       	(2)

>From (1) and (2), we get equality

         ref_bw == write_bw / N                      	(3)

If the N dd's are all throttled at ref_bw, the dirty/writeback rates
will match. So ref_bw is the balanced dirty rate.

In practice, the ref_bw calculated by (2) may fluctuate and have
estimation errors. So the bdi->dirty_ratelimit update policy is to
follow it only when both pos_bw and ref_bw point to the same direction
(indicating not only the dirty position has deviated from the global/bdi
setpoints, but also it's still departing away).

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    7 +++
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |   69 +++++++++++++++++++++++++++++++++-
 3 files changed, 75 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2011-08-05 18:05:36.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-05 18:05:36.000000000 +0800
@@ -75,10 +75,17 @@ struct backing_dev_info {
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	unsigned long bw_time_stamp;	/* last time write bw is updated */
+	unsigned long dirtied_stamp;
 	unsigned long written_stamp;	/* pages written at bw_time_stamp */
 	unsigned long write_bandwidth;	/* the estimated write bandwidth */
 	unsigned long avg_write_bandwidth; /* further smoothed write bw */
 
+	/*
+	 * The base throttle bandwidth, re-calculated on every 200ms.
+	 * All the bdi tasks' dirty rate will be curbed under it.
+	 */
+	unsigned long dirty_ratelimit;
+
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
--- linux-next.orig/mm/backing-dev.c	2011-08-05 18:05:36.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-08-05 18:05:36.000000000 +0800
@@ -674,6 +674,7 @@ int bdi_init(struct backing_dev_info *bd
 	bdi->bw_time_stamp = jiffies;
 	bdi->written_stamp = 0;
 
+	bdi->dirty_ratelimit = INIT_BW;
 	bdi->write_bandwidth = INIT_BW;
 	bdi->avg_write_bandwidth = INIT_BW;
 
--- linux-next.orig/mm/page-writeback.c	2011-08-05 18:05:36.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-06 09:08:35.000000000 +0800
@@ -736,6 +736,66 @@ static void global_update_bandwidth(unsi
 	spin_unlock(&dirty_lock);
 }
 
+/*
+ * Maintain bdi->dirty_ratelimit, the base throttle bandwidth.
+ *
+ * Normal bdi tasks will be curbed at or below it in long term.
+ * Obviously it should be around (write_bw / N) when there are N dd tasks.
+ */
+static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
+				       unsigned long thresh,
+				       unsigned long dirty,
+				       unsigned long bdi_thresh,
+				       unsigned long bdi_dirty,
+				       unsigned long dirtied,
+				       unsigned long elapsed)
+{
+	unsigned long bw = bdi->dirty_ratelimit;
+	unsigned long dirty_bw;
+	unsigned long pos_bw;
+	unsigned long ref_bw;
+	unsigned long long pos_ratio;
+
+	/*
+	 * The dirty rate will match the writeback rate in long term, except
+	 * when dirty pages are truncated by userspace or re-dirtied by FS.
+	 */
+	dirty_bw = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+
+	pos_ratio = bdi_position_ratio(bdi, thresh, dirty,
+				       bdi_thresh, bdi_dirty);
+	/*
+	 * pos_bw reflects each dd's dirty rate enforced for the past 200ms.
+	 */
+	pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
+	pos_bw++;  /* this avoids bdi->dirty_ratelimit get stuck in 0 */
+
+	/*
+	 * ref_bw = pos_bw * write_bw / dirty_bw
+	 *
+	 * It's a linear estimation of the "balanced" throttle bandwidth.
+	 */
+	pos_ratio *= bdi->avg_write_bandwidth;
+	do_div(pos_ratio, dirty_bw | 1);
+	ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
+
+	/*
+	 * dirty_ratelimit will follow ref_bw/pos_bw conservatively iff they
+	 * are on the same side of dirty_ratelimit. Which not only makes it
+	 * more stable, but also is essential for preventing it being driven
+	 * away by possible systematic errors in ref_bw.
+	 */
+	if (pos_bw < bw) {
+		if (ref_bw < bw)
+			bw = max(ref_bw, pos_bw);
+	} else {
+		if (ref_bw > bw)
+			bw = min(ref_bw, pos_bw);
+	}
+
+	bdi->dirty_ratelimit = bw;
+}
+
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
 			    unsigned long dirty,
@@ -745,6 +805,7 @@ void __bdi_update_bandwidth(struct backi
 {
 	unsigned long now = jiffies;
 	unsigned long elapsed = now - bdi->bw_time_stamp;
+	unsigned long dirtied;
 	unsigned long written;
 
 	/*
@@ -753,6 +814,7 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed < BANDWIDTH_INTERVAL)
 		return;
 
+	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
 
 	/*
@@ -762,12 +824,15 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
 		goto snapshot;
 
-	if (thresh)
+	if (thresh) {
 		global_update_bandwidth(thresh, dirty, now);
-
+		bdi_update_dirty_ratelimit(bdi, thresh, dirty, bdi_thresh,
+					   bdi_dirty, dirtied, elapsed);
+	}
 	bdi_update_write_bandwidth(bdi, elapsed, written);
 
 snapshot:
+	bdi->dirtied_stamp = dirtied;
 	bdi->written_stamp = written;
 	bdi->bw_time_stamp = now;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

end of thread, other threads:[~2011-08-24  3:16 UTC | newest]

Thread overview: 98+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-16  2:20 [PATCH 0/5] IO-less dirty throttling v9 Wu Fengguang
2011-08-16  2:20 ` Wu Fengguang
2011-08-16  2:20 ` [PATCH 1/5] writeback: account per-bdi accumulated dirtied pages Wu Fengguang
2011-08-16  2:20   ` Wu Fengguang
2011-08-16  2:20 ` [PATCH 2/5] writeback: dirty position control Wu Fengguang
2011-08-16  2:20   ` Wu Fengguang
2011-08-16  2:20   ` Wu Fengguang
2011-08-16 19:41   ` Jan Kara
2011-08-16 19:41     ` Jan Kara
2011-08-17 13:23     ` Wu Fengguang
2011-08-17 13:49       ` Wu Fengguang
2011-08-17 13:49         ` Wu Fengguang
2011-08-17 20:24       ` Jan Kara
2011-08-17 20:24         ` Jan Kara
2011-08-18  4:18         ` Wu Fengguang
2011-08-18  4:18           ` Wu Fengguang
2011-08-18  4:41           ` Wu Fengguang
2011-08-18  4:41             ` Wu Fengguang
2011-08-18 19:16           ` Jan Kara
2011-08-18 19:16             ` Jan Kara
2011-08-24  3:16         ` Wu Fengguang
2011-08-24  3:16           ` Wu Fengguang
2011-08-19  2:53   ` Vivek Goyal
2011-08-19  2:53     ` Vivek Goyal
2011-08-19  3:25     ` Wu Fengguang
2011-08-19  3:25       ` Wu Fengguang
2011-08-16  2:20 ` [PATCH 3/5] writeback: dirty rate control Wu Fengguang
2011-08-16  2:20   ` Wu Fengguang
2011-08-16  2:20   ` Wu Fengguang
2011-08-16  2:20 ` [PATCH 4/5] writeback: per task dirty rate limit Wu Fengguang
2011-08-16  2:20   ` Wu Fengguang
2011-08-16  7:17   ` Andrea Righi
2011-08-16  7:17     ` Andrea Righi
2011-08-16  7:22     ` Wu Fengguang
2011-08-16  7:22       ` Wu Fengguang
2011-08-16  2:20 ` [PATCH 5/5] writeback: IO-less balance_dirty_pages() Wu Fengguang
2011-08-16  2:20   ` Wu Fengguang
2011-08-16  2:20   ` Wu Fengguang
2011-08-19  2:06   ` Vivek Goyal
2011-08-19  2:06     ` Vivek Goyal
2011-08-19  2:54     ` Wu Fengguang
2011-08-19  2:54       ` Wu Fengguang
2011-08-19 19:00       ` Vivek Goyal
2011-08-19 19:00         ` Vivek Goyal
2011-08-21  3:46         ` Wu Fengguang
2011-08-21  3:46           ` Wu Fengguang
2011-08-22 17:22           ` Vivek Goyal
2011-08-22 17:22             ` Vivek Goyal
2011-08-23  1:07             ` Wu Fengguang
2011-08-23  1:07               ` Wu Fengguang
2011-08-23  3:53               ` Wu Fengguang
2011-08-23  3:53                 ` Wu Fengguang
2011-08-23 13:53               ` Vivek Goyal
2011-08-23 13:53                 ` Vivek Goyal
2011-08-24  3:09                 ` Wu Fengguang
2011-08-24  3:09                   ` Wu Fengguang
  -- strict thread matches above, loose matches on Subject: below --
2011-08-06  8:44 [PATCH 0/5] IO-less dirty throttling v8 Wu Fengguang
2011-08-06  8:44 ` [PATCH 3/5] writeback: dirty rate control Wu Fengguang
2011-08-06  8:44   ` Wu Fengguang
2011-08-06  8:44   ` Wu Fengguang
2011-08-09 14:54   ` Vivek Goyal
2011-08-09 14:54     ` Vivek Goyal
2011-08-11  3:42     ` Wu Fengguang
2011-08-11  3:42       ` Wu Fengguang
2011-08-09 14:57   ` Peter Zijlstra
2011-08-09 14:57     ` Peter Zijlstra
2011-08-09 14:57     ` Peter Zijlstra
2011-08-10 11:07     ` Wu Fengguang
2011-08-10 11:07       ` Wu Fengguang
2011-08-10 16:17       ` Peter Zijlstra
2011-08-10 16:17         ` Peter Zijlstra
2011-08-10 16:17         ` Peter Zijlstra
2011-08-15 14:08         ` Wu Fengguang
2011-08-15 14:08           ` Wu Fengguang
2011-08-09 15:50   ` Vivek Goyal
2011-08-09 15:50     ` Vivek Goyal
2011-08-09 16:16     ` Peter Zijlstra
2011-08-09 16:16       ` Peter Zijlstra
2011-08-09 16:16       ` Peter Zijlstra
2011-08-09 16:19       ` Peter Zijlstra
2011-08-09 16:19         ` Peter Zijlstra
2011-08-09 16:19         ` Peter Zijlstra
2011-08-10 14:07         ` Wu Fengguang
2011-08-10 14:07           ` Wu Fengguang
2011-08-10 14:00       ` Wu Fengguang
2011-08-10 14:00         ` Wu Fengguang
2011-08-10 17:10         ` Peter Zijlstra
2011-08-10 17:10           ` Peter Zijlstra
2011-08-15 14:11           ` Wu Fengguang
2011-08-15 14:11             ` Wu Fengguang
2011-08-09 16:56   ` Peter Zijlstra
2011-08-09 16:56     ` Peter Zijlstra
2011-08-09 16:56     ` Peter Zijlstra
2011-08-10 14:10     ` Wu Fengguang
2011-08-09 17:02   ` Peter Zijlstra
2011-08-09 17:02     ` Peter Zijlstra
2011-08-09 17:02     ` Peter Zijlstra
2011-08-10 14:15     ` Wu Fengguang
2011-08-10 14:15       ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.