All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] IO-less dirty throttling v9
@ 2011-08-16  2:20 ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML, Wu Fengguang

Hi,

The core bits of the IO-less balance_dirty_pages().

        git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v9

Changes since v8:

- a lot of renames and comment/changelog rework
- use 3rd order polynomial as the global control line (Peter)
- stabilize dirty_ratelimit by decreasing update step size on small errors
- limit per-CPU dirtied pages to avoid dirty pages run away on 1k+ tasks (Peter)

Thanks a lot to Peter and Andrea, Vivek for the careful reviews!

shortlog:
        
        Wu Fengguang (5):
              writeback: account per-bdi accumulated dirtied pages
              writeback: dirty position control
              writeback: dirty rate control
              writeback: per task dirty rate limit
              writeback: IO-less balance_dirty_pages()

        The last 4 patches are one single logical change, but splitted here to
        make it easier to review the different parts of the algorithm.

diffstat:

	 fs/fs-writeback.c                |    2 
	 include/linux/backing-dev.h      |    8 
	 include/linux/sched.h            |    7 
	 include/linux/writeback.h        |    1 
	 include/trace/events/writeback.h |   24 -
	 kernel/fork.c                    |    3 
	 mm/backing-dev.c                 |    3 
	 mm/page-writeback.c              |  544 ++++++++++++++++++++---------
	 8 files changed, 414 insertions(+), 178 deletions(-)

Thanks,
Fengguang



^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 0/5] IO-less dirty throttling v9
@ 2011-08-16  2:20 ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML, Wu Fengguang

Hi,

The core bits of the IO-less balance_dirty_pages().

        git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v9

Changes since v8:

- a lot of renames and comment/changelog rework
- use 3rd order polynomial as the global control line (Peter)
- stabilize dirty_ratelimit by decreasing update step size on small errors
- limit per-CPU dirtied pages to avoid dirty pages run away on 1k+ tasks (Peter)

Thanks a lot to Peter and Andrea, Vivek for the careful reviews!

shortlog:
        
        Wu Fengguang (5):
              writeback: account per-bdi accumulated dirtied pages
              writeback: dirty position control
              writeback: dirty rate control
              writeback: per task dirty rate limit
              writeback: IO-less balance_dirty_pages()

        The last 4 patches are one single logical change, but splitted here to
        make it easier to review the different parts of the algorithm.

diffstat:

	 fs/fs-writeback.c                |    2 
	 include/linux/backing-dev.h      |    8 
	 include/linux/sched.h            |    7 
	 include/linux/writeback.h        |    1 
	 include/trace/events/writeback.h |   24 -
	 kernel/fork.c                    |    3 
	 mm/backing-dev.c                 |    3 
	 mm/page-writeback.c              |  544 ++++++++++++++++++++---------
	 8 files changed, 414 insertions(+), 178 deletions(-)

Thanks,
Fengguang


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 1/5] writeback: account per-bdi accumulated dirtied pages
  2011-08-16  2:20 ` Wu Fengguang
@ 2011-08-16  2:20   ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Jan Kara, Michael Rubin, Wu Fengguang,
	Andrew Morton, Christoph Hellwig, Dave Chinner, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-bdi-dirtied.patch --]
[-- Type: text/plain, Size: 2019 bytes --]

Introduce the BDI_DIRTIED counter. It will be used for estimating the
bdi's dirty bandwidth.

CC: Jan Kara <jack@suse.cz>
CC: Michael Rubin <mrubin@google.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    2 ++
 mm/page-writeback.c         |    1 +
 3 files changed, 4 insertions(+)

--- linux-next.orig/include/linux/backing-dev.h	2011-06-12 20:58:31.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-06-12 20:58:40.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_DIRTIED,
 	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
--- linux-next.orig/mm/page-writeback.c	2011-06-12 20:58:31.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-12 20:58:40.000000000 +0800
@@ -1530,6 +1530,7 @@ void account_page_dirtied(struct page *p
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
 	}
--- linux-next.orig/mm/backing-dev.c	2011-06-12 20:58:31.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-06-12 20:58:55.000000000 +0800
@@ -97,6 +97,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:     %10lu kB\n"
 		   "DirtyThresh:        %10lu kB\n"
 		   "BackgroundThresh:   %10lu kB\n"
+		   "BdiDirtied:         %10lu kB\n"
 		   "BdiWritten:         %10lu kB\n"
 		   "BdiWriteBandwidth:  %10lu kBps\n"
 		   "b_dirty:            %10lu\n"
@@ -109,6 +110,7 @@ static int bdi_debug_stats_show(struct s
 		   K(bdi_thresh),
 		   K(dirty_thresh),
 		   K(background_thresh),
+		   (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
 		   (unsigned long) K(bdi->write_bandwidth),
 		   nr_dirty,



^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 1/5] writeback: account per-bdi accumulated dirtied pages
@ 2011-08-16  2:20   ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Jan Kara, Michael Rubin, Wu Fengguang,
	Andrew Morton, Christoph Hellwig, Dave Chinner, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-bdi-dirtied.patch --]
[-- Type: text/plain, Size: 2322 bytes --]

Introduce the BDI_DIRTIED counter. It will be used for estimating the
bdi's dirty bandwidth.

CC: Jan Kara <jack@suse.cz>
CC: Michael Rubin <mrubin@google.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    2 ++
 mm/page-writeback.c         |    1 +
 3 files changed, 4 insertions(+)

--- linux-next.orig/include/linux/backing-dev.h	2011-06-12 20:58:31.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-06-12 20:58:40.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_DIRTIED,
 	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
--- linux-next.orig/mm/page-writeback.c	2011-06-12 20:58:31.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-12 20:58:40.000000000 +0800
@@ -1530,6 +1530,7 @@ void account_page_dirtied(struct page *p
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
 	}
--- linux-next.orig/mm/backing-dev.c	2011-06-12 20:58:31.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-06-12 20:58:55.000000000 +0800
@@ -97,6 +97,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:     %10lu kB\n"
 		   "DirtyThresh:        %10lu kB\n"
 		   "BackgroundThresh:   %10lu kB\n"
+		   "BdiDirtied:         %10lu kB\n"
 		   "BdiWritten:         %10lu kB\n"
 		   "BdiWriteBandwidth:  %10lu kBps\n"
 		   "b_dirty:            %10lu\n"
@@ -109,6 +110,7 @@ static int bdi_debug_stats_show(struct s
 		   K(bdi_thresh),
 		   K(dirty_thresh),
 		   K(background_thresh),
+		   (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
 		   (unsigned long) K(bdi->write_bandwidth),
 		   nr_dirty,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 2/5] writeback: dirty position control
  2011-08-16  2:20 ` Wu Fengguang
  (?)
@ 2011-08-16  2:20   ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 13157 bytes --]

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range [setpoint - write_bw/2, setpoint + write_bw/2],
we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the write bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  196 +++++++++++++++++++++++++++++++++++-
 3 files changed, 193 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-14 18:03:49.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-14 21:33:39.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -495,6 +503,180 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0                 setpoint                     x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* the target balance point */
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                         setpoint - dirty 3
+	 *        f(dirty) := 1 + (----------------)
+	 *                         limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx       < 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = setpoint + 8 * write_bw
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
+	 * for various filesystems, where (2) can yield in a reasonable 12.5%
+	 * fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly.
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
+	setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(4*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 */
+	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
+		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
+		       thresh + 1);
+	x_intercept = setpoint + 2 * span;
+
+	if (unlikely(bdi_dirty > setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -593,6 +775,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -629,6 +812,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -637,8 +821,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -679,7 +863,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -723,8 +908,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-14 18:03:50.000000000 +0800
@@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-14 18:03:50.000000000 +0800
@@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,



^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 2/5] writeback: dirty position control
@ 2011-08-16  2:20   ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 13460 bytes --]

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range [setpoint - write_bw/2, setpoint + write_bw/2],
we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the write bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  196 +++++++++++++++++++++++++++++++++++-
 3 files changed, 193 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-14 18:03:49.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-14 21:33:39.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -495,6 +503,180 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0                 setpoint                     x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* the target balance point */
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                         setpoint - dirty 3
+	 *        f(dirty) := 1 + (----------------)
+	 *                         limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx       < 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = setpoint + 8 * write_bw
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
+	 * for various filesystems, where (2) can yield in a reasonable 12.5%
+	 * fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly.
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
+	setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(4*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 */
+	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
+		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
+		       thresh + 1);
+	x_intercept = setpoint + 2 * span;
+
+	if (unlikely(bdi_dirty > setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -593,6 +775,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -629,6 +812,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -637,8 +821,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -679,7 +863,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -723,8 +908,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-14 18:03:50.000000000 +0800
@@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-14 18:03:50.000000000 +0800
@@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 2/5] writeback: dirty position control
@ 2011-08-16  2:20   ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 13460 bytes --]

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range [setpoint - write_bw/2, setpoint + write_bw/2],
we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the write bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  196 +++++++++++++++++++++++++++++++++++-
 3 files changed, 193 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-14 18:03:49.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-14 21:33:39.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -495,6 +503,180 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0                 setpoint                     x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* the target balance point */
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                         setpoint - dirty 3
+	 *        f(dirty) := 1 + (----------------)
+	 *                         limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx       < 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = setpoint + 8 * write_bw
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
+	 * for various filesystems, where (2) can yield in a reasonable 12.5%
+	 * fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly.
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
+	setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(4*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 */
+	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
+		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
+		       thresh + 1);
+	x_intercept = setpoint + 2 * span;
+
+	if (unlikely(bdi_dirty > setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -593,6 +775,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -629,6 +812,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -637,8 +821,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -679,7 +863,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -723,8 +908,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-14 18:03:50.000000000 +0800
@@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-14 18:03:50.000000000 +0800
@@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 3/5] writeback: dirty rate control
  2011-08-16  2:20 ` Wu Fengguang
  (?)
@ 2011-08-16  2:20   ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: dirty-ratelimit --]
[-- Type: text/plain, Size: 12818 bytes --]

It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
when there are N dd tasks.

On write() syscall, use bdi->dirty_ratelimit
============================================

    balance_dirty_pages(pages_dirtied)
    {
        task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
        pause = pages_dirtied / task_ratelimit;
        sleep(pause);
    }

On every 200ms, update bdi->dirty_ratelimit
===========================================

    bdi_update_dirty_ratelimit()
    {
    	pos_rate = ratelimit_in_past_200ms
		 = bdi->dirty_ratelimit * bdi_position_ratio();

	balanced_rate = ratelimit_in_past_200ms * write_bw / dirty_rate;

        update bdi->dirty_ratelimit closer to balanced_rate and pos_rate
    }

Estimation of balanced bdi->dirty_ratelimit
===========================================

balanced task_ratelimit
-----------------------

balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.

IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:

	ratelimit = write_bw						(1)

The fairness requirement gives us:

        task_ratelimit = write_bw / N					(2)

where N is the number of dd tasks.  We don't know N beforehand, but
still can estimate the balanced task_ratelimit within 200ms.

Start by throttling each dd task at rate

        task_ratelimit = task_ratelimit_0				(3)
 		  	 (any non-zero initial value is OK)

After 200ms, we measured

        dirty_rate = # of pages dirtied by all dd's / 200ms
        write_bw   = # of pages written to the disk / 200ms

For the aggressive dd dirtiers, the equality holds

	dirty_rate == N * task_rate
                   == N * task_ratelimit
                   == N * task_ratelimit_0            			(4)
Or
	task_ratelimit_0 = dirty_rate / N            			(5)

Now we conclude that the balanced task ratelimit can be estimated by

        task_ratelimit = task_ratelimit_0 * (write_bw / dirty_rate)	(6)

Because with (4) and (5) we can get the desired equality (1):

	task_ratelimit == (dirty_rate / N) * (write_bw / dirty_rate)
	       	       == write_bw / N

Then using the balanced task ratelimit we can compute task pause times like:

        task_pause = task->nr_dirtied / task_ratelimit

task_ratelimit with position control
------------------------------------

However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.

The dirty position control works by splitting (6) to

        task_ratelimit = balanced_rate					(7)
        balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)	(8)

and extend (7) to

        task_ratelimit = balanced_rate * pos_ratio			(9)

where pos_ratio is a negative feedback function that subjects to

1) f(setpoint) = 1.0
2) df/dx < 0

That is, if the dirty pages are ABOVE the setpoint, we throttle each
task a bit more HEAVY than balanced_rate, so that the dirty pages are
created less fast than they are cleaned, thus DROP to the setpoints
(and the reverse).

bdi->dirty_ratelimit update policy
----------------------------------

The balanced_rate calculated by (8) is not suitable for direct use (*).
For the reasons listed below, (9) is further transformed into

	task_ratelimit = dirty_ratelimit * pos_ratio			(10)

where dirty_ratelimit will be tracking balanced_rate _conservatively_.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
(*) There are some imperfections in balanced_rate, which make it not
suitable for direct use:

1) large fluctuations

The dirty_rate used for computing balanced_rate is merely averaged in
the past 200ms (very small comparing to the 3s estimation period for
write_bw), which makes rather dispersed distribution of balanced_rate.

It's pretty hard to average out the singular points by increasing the
estimation period. Considering that the averaging technique will
introduce very undesirable time lags, I give it up totally. (btw, the 3s
write_bw averaging time lag is much more acceptable because its impact
is one-way and therefore won't lead to oscillations.)

The more practical way is filtering -- most singular balanced_rate
points can be filtered out by remembering some prev_balanced_rate and
prev_prev_balanced_rate. However the more reliable way is to guard
balanced_rate with pos_rate.

2) due to truncates and fs redirties, the (write_bw <=> dirty_rate)
match could become unbalanced, which may lead to large systematical
errors in balanced_rate. The truncates, due to its possibly bumpy
nature, can hardly be compensated smoothly. So let's face it. When some
over-estimated balanced_rate brings dirty_ratelimit high, dirty pages
will go higher than the setpoint. pos_rate will in turn become lower
than dirty_ratelimit.  So if we consider both balanced_rate and pos_rate
and update dirty_ratelimit only when they are on the same side of
dirty_ratelimit, the systematical errors in balanced_rate won't be able
to bring dirty_ratelimit far away.

The balanced_rate estimation may also be inaccurate when near the max
pause and free run areas, however is less an issue.

3) since we ultimately want to

- keep the fluctuations of task ratelimit as small as possible
- keep the dirty pages around the setpoint as long time as possible

the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (pos_rate < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < balanced_rate), there is
no point to bring up dirty_ratelimit in a hurry only to hurt both the
above two goals.

In summary, the dirty_ratelimit update policy consists of two constraints:

1) avoid changing dirty rate when it's against the position control target
   (the adjusted rate will slow down the progress of dirty pages going
   back to setpoint).

2) limit the step size. pos_rate is changing values step by step,
   leaving a consistent trace comparing to the randomly jumping
   balanced_rate. pos_rate also has the nice smaller errors in stable
   state and typically larger errors when there are big errors in rate.
   So it's a pretty good limiting factor for the step size of dirty_ratelimit.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
---
 include/linux/backing-dev.h |    7 ++
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |  108 +++++++++++++++++++++++++++++++++-
 3 files changed, 114 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2011-08-16 10:07:22.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-16 10:07:23.000000000 +0800
@@ -75,10 +75,17 @@ struct backing_dev_info {
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	unsigned long bw_time_stamp;	/* last time write bw is updated */
+	unsigned long dirtied_stamp;
 	unsigned long written_stamp;	/* pages written at bw_time_stamp */
 	unsigned long write_bandwidth;	/* the estimated write bandwidth */
 	unsigned long avg_write_bandwidth; /* further smoothed write bw */
 
+	/*
+	 * The base dirty throttle rate, re-calculated on every 200ms.
+	 * All the bdi tasks' dirty rate will be curbed under it.
+	 */
+	unsigned long dirty_ratelimit;
+
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
--- linux-next.orig/mm/backing-dev.c	2011-08-16 10:07:22.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-08-16 10:07:23.000000000 +0800
@@ -674,6 +674,7 @@ int bdi_init(struct backing_dev_info *bd
 	bdi->bw_time_stamp = jiffies;
 	bdi->written_stamp = 0;
 
+	bdi->dirty_ratelimit = INIT_BW;
 	bdi->write_bandwidth = INIT_BW;
 	bdi->avg_write_bandwidth = INIT_BW;
 
--- linux-next.orig/mm/page-writeback.c	2011-08-16 10:07:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-16 10:13:33.000000000 +0800
@@ -773,6 +773,104 @@ static void global_update_bandwidth(unsi
 	spin_unlock(&dirty_lock);
 }
 
+/*
+ * Maintain bdi->dirty_ratelimit, the base dirty throttle rate.
+ *
+ * Normal bdi tasks will be curbed at or below it in long term.
+ * Obviously it should be around (write_bw / N) when there are N dd tasks.
+ */
+static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
+				       unsigned long thresh,
+				       unsigned long bg_thresh,
+				       unsigned long dirty,
+				       unsigned long bdi_thresh,
+				       unsigned long bdi_dirty,
+				       unsigned long dirtied,
+				       unsigned long elapsed)
+{
+	unsigned long base_rate = bdi->dirty_ratelimit;
+	unsigned long dirty_rate;
+	unsigned long executed_rate;
+	unsigned long balanced_rate;
+	unsigned long pos_rate;
+	unsigned long delta;
+	unsigned long pos_ratio;
+
+	/*
+	 * The dirty rate will match the writeback rate in long term, except
+	 * when dirty pages are truncated by userspace or re-dirtied by FS.
+	 */
+	dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+
+	pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
+				       bdi_thresh, bdi_dirty);
+	/*
+	 * executed_rate reflects each dd's dirty rate for the past 200ms.
+	 */
+	executed_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * A linear estimation of the "balanced" throttle bandwidth.
+	 */
+	balanced_rate = div_u64((u64)executed_rate * bdi->avg_write_bandwidth,
+				dirty_rate | 1);
+
+	/*
+	 * Use a different name for the same value to distinguish the concepts.
+	 * Only the relative value of
+	 *     (pos_rate - base_rate) = (pos_ratio - 1) * base_rate
+	 * will be used below, which reflects the direction and size of dirty
+	 * position error.
+	 */
+	pos_rate = executed_rate;
+
+	/*
+	 * dirty_ratelimit will follow balanced_rate iff pos_rate is on the
+	 * same side of dirty_ratelimit, too.
+	 * For example,
+	 * - (base_rate > balanced_rate) => dirty rate is too high
+	 * - (base_rate > pos_rate)      => dirty pages are above setpoint
+	 * so lowering base_rate will help meet both the position and rate
+	 * control targets. Otherwise, don't update base_rate if it will only
+	 * help meet the rate target. After all, what the users ultimately feel
+	 * and care are stable dirty rate and small position error.  This
+	 * update policy can also prevent dirty_ratelimit from being driven
+	 * away by possible systematic errors in balanced_rate.
+	 *
+	 * |base_rate - pos_rate| is also used to limit the step size for
+	 * filtering out the sigular points of balanced_rate, which keeps
+	 * jumping around randomly and can even leap far away at times due to
+	 * the small 200ms estimation period of dirty_rate (we want to keep
+	 * that period small to reduce time lags).
+	 */
+	delta = 0;
+	if (base_rate < balanced_rate) {
+		if (base_rate < pos_rate)
+			delta = min(balanced_rate, pos_rate) - base_rate;
+	} else {
+		if (base_rate > pos_rate)
+			delta = base_rate - max(balanced_rate, pos_rate);
+	}
+
+	/*
+	 * Don't pursue 100% rate matching. It's impossible since the balanced
+	 * rate itself is constantly fluctuating. So decrease the track speed
+	 * when it gets close to the target. Helps eliminate pointless tremors.
+	 */
+	delta >>= base_rate / (8 * delta + 1);
+	/*
+	 * Limit the tracking speed to avoid overshooting.
+	 */
+	delta = (delta + 7) / 8;
+
+	if (base_rate < balanced_rate)
+		base_rate += delta;
+	else
+		base_rate -= delta;
+
+	bdi->dirty_ratelimit = max(base_rate, 1);
+}
+
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
 			    unsigned long bg_thresh,
@@ -783,6 +881,7 @@ void __bdi_update_bandwidth(struct backi
 {
 	unsigned long now = jiffies;
 	unsigned long elapsed = now - bdi->bw_time_stamp;
+	unsigned long dirtied;
 	unsigned long written;
 
 	/*
@@ -791,6 +890,7 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed < BANDWIDTH_INTERVAL)
 		return;
 
+	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
 
 	/*
@@ -800,12 +900,16 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
 		goto snapshot;
 
-	if (thresh)
+	if (thresh) {
 		global_update_bandwidth(thresh, dirty, now);
-
+		bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
+					   bdi_thresh, bdi_dirty,
+					   dirtied, elapsed);
+	}
 	bdi_update_write_bandwidth(bdi, elapsed, written);
 
 snapshot:
+	bdi->dirtied_stamp = dirtied;
 	bdi->written_stamp = written;
 	bdi->bw_time_stamp = now;
 }



^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 3/5] writeback: dirty rate control
@ 2011-08-16  2:20   ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: dirty-ratelimit --]
[-- Type: text/plain, Size: 13121 bytes --]

It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
when there are N dd tasks.

On write() syscall, use bdi->dirty_ratelimit
============================================

    balance_dirty_pages(pages_dirtied)
    {
        task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
        pause = pages_dirtied / task_ratelimit;
        sleep(pause);
    }

On every 200ms, update bdi->dirty_ratelimit
===========================================

    bdi_update_dirty_ratelimit()
    {
    	pos_rate = ratelimit_in_past_200ms
		 = bdi->dirty_ratelimit * bdi_position_ratio();

	balanced_rate = ratelimit_in_past_200ms * write_bw / dirty_rate;

        update bdi->dirty_ratelimit closer to balanced_rate and pos_rate
    }

Estimation of balanced bdi->dirty_ratelimit
===========================================

balanced task_ratelimit
-----------------------

balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.

IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:

	ratelimit = write_bw						(1)

The fairness requirement gives us:

        task_ratelimit = write_bw / N					(2)

where N is the number of dd tasks.  We don't know N beforehand, but
still can estimate the balanced task_ratelimit within 200ms.

Start by throttling each dd task at rate

        task_ratelimit = task_ratelimit_0				(3)
 		  	 (any non-zero initial value is OK)

After 200ms, we measured

        dirty_rate = # of pages dirtied by all dd's / 200ms
        write_bw   = # of pages written to the disk / 200ms

For the aggressive dd dirtiers, the equality holds

	dirty_rate == N * task_rate
                   == N * task_ratelimit
                   == N * task_ratelimit_0            			(4)
Or
	task_ratelimit_0 = dirty_rate / N            			(5)

Now we conclude that the balanced task ratelimit can be estimated by

        task_ratelimit = task_ratelimit_0 * (write_bw / dirty_rate)	(6)

Because with (4) and (5) we can get the desired equality (1):

	task_ratelimit == (dirty_rate / N) * (write_bw / dirty_rate)
	       	       == write_bw / N

Then using the balanced task ratelimit we can compute task pause times like:

        task_pause = task->nr_dirtied / task_ratelimit

task_ratelimit with position control
------------------------------------

However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.

The dirty position control works by splitting (6) to

        task_ratelimit = balanced_rate					(7)
        balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)	(8)

and extend (7) to

        task_ratelimit = balanced_rate * pos_ratio			(9)

where pos_ratio is a negative feedback function that subjects to

1) f(setpoint) = 1.0
2) df/dx < 0

That is, if the dirty pages are ABOVE the setpoint, we throttle each
task a bit more HEAVY than balanced_rate, so that the dirty pages are
created less fast than they are cleaned, thus DROP to the setpoints
(and the reverse).

bdi->dirty_ratelimit update policy
----------------------------------

The balanced_rate calculated by (8) is not suitable for direct use (*).
For the reasons listed below, (9) is further transformed into

	task_ratelimit = dirty_ratelimit * pos_ratio			(10)

where dirty_ratelimit will be tracking balanced_rate _conservatively_.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
(*) There are some imperfections in balanced_rate, which make it not
suitable for direct use:

1) large fluctuations

The dirty_rate used for computing balanced_rate is merely averaged in
the past 200ms (very small comparing to the 3s estimation period for
write_bw), which makes rather dispersed distribution of balanced_rate.

It's pretty hard to average out the singular points by increasing the
estimation period. Considering that the averaging technique will
introduce very undesirable time lags, I give it up totally. (btw, the 3s
write_bw averaging time lag is much more acceptable because its impact
is one-way and therefore won't lead to oscillations.)

The more practical way is filtering -- most singular balanced_rate
points can be filtered out by remembering some prev_balanced_rate and
prev_prev_balanced_rate. However the more reliable way is to guard
balanced_rate with pos_rate.

2) due to truncates and fs redirties, the (write_bw <=> dirty_rate)
match could become unbalanced, which may lead to large systematical
errors in balanced_rate. The truncates, due to its possibly bumpy
nature, can hardly be compensated smoothly. So let's face it. When some
over-estimated balanced_rate brings dirty_ratelimit high, dirty pages
will go higher than the setpoint. pos_rate will in turn become lower
than dirty_ratelimit.  So if we consider both balanced_rate and pos_rate
and update dirty_ratelimit only when they are on the same side of
dirty_ratelimit, the systematical errors in balanced_rate won't be able
to bring dirty_ratelimit far away.

The balanced_rate estimation may also be inaccurate when near the max
pause and free run areas, however is less an issue.

3) since we ultimately want to

- keep the fluctuations of task ratelimit as small as possible
- keep the dirty pages around the setpoint as long time as possible

the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (pos_rate < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < balanced_rate), there is
no point to bring up dirty_ratelimit in a hurry only to hurt both the
above two goals.

In summary, the dirty_ratelimit update policy consists of two constraints:

1) avoid changing dirty rate when it's against the position control target
   (the adjusted rate will slow down the progress of dirty pages going
   back to setpoint).

2) limit the step size. pos_rate is changing values step by step,
   leaving a consistent trace comparing to the randomly jumping
   balanced_rate. pos_rate also has the nice smaller errors in stable
   state and typically larger errors when there are big errors in rate.
   So it's a pretty good limiting factor for the step size of dirty_ratelimit.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
---
 include/linux/backing-dev.h |    7 ++
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |  108 +++++++++++++++++++++++++++++++++-
 3 files changed, 114 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2011-08-16 10:07:22.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-16 10:07:23.000000000 +0800
@@ -75,10 +75,17 @@ struct backing_dev_info {
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	unsigned long bw_time_stamp;	/* last time write bw is updated */
+	unsigned long dirtied_stamp;
 	unsigned long written_stamp;	/* pages written at bw_time_stamp */
 	unsigned long write_bandwidth;	/* the estimated write bandwidth */
 	unsigned long avg_write_bandwidth; /* further smoothed write bw */
 
+	/*
+	 * The base dirty throttle rate, re-calculated on every 200ms.
+	 * All the bdi tasks' dirty rate will be curbed under it.
+	 */
+	unsigned long dirty_ratelimit;
+
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
--- linux-next.orig/mm/backing-dev.c	2011-08-16 10:07:22.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-08-16 10:07:23.000000000 +0800
@@ -674,6 +674,7 @@ int bdi_init(struct backing_dev_info *bd
 	bdi->bw_time_stamp = jiffies;
 	bdi->written_stamp = 0;
 
+	bdi->dirty_ratelimit = INIT_BW;
 	bdi->write_bandwidth = INIT_BW;
 	bdi->avg_write_bandwidth = INIT_BW;
 
--- linux-next.orig/mm/page-writeback.c	2011-08-16 10:07:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-16 10:13:33.000000000 +0800
@@ -773,6 +773,104 @@ static void global_update_bandwidth(unsi
 	spin_unlock(&dirty_lock);
 }
 
+/*
+ * Maintain bdi->dirty_ratelimit, the base dirty throttle rate.
+ *
+ * Normal bdi tasks will be curbed at or below it in long term.
+ * Obviously it should be around (write_bw / N) when there are N dd tasks.
+ */
+static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
+				       unsigned long thresh,
+				       unsigned long bg_thresh,
+				       unsigned long dirty,
+				       unsigned long bdi_thresh,
+				       unsigned long bdi_dirty,
+				       unsigned long dirtied,
+				       unsigned long elapsed)
+{
+	unsigned long base_rate = bdi->dirty_ratelimit;
+	unsigned long dirty_rate;
+	unsigned long executed_rate;
+	unsigned long balanced_rate;
+	unsigned long pos_rate;
+	unsigned long delta;
+	unsigned long pos_ratio;
+
+	/*
+	 * The dirty rate will match the writeback rate in long term, except
+	 * when dirty pages are truncated by userspace or re-dirtied by FS.
+	 */
+	dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+
+	pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
+				       bdi_thresh, bdi_dirty);
+	/*
+	 * executed_rate reflects each dd's dirty rate for the past 200ms.
+	 */
+	executed_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * A linear estimation of the "balanced" throttle bandwidth.
+	 */
+	balanced_rate = div_u64((u64)executed_rate * bdi->avg_write_bandwidth,
+				dirty_rate | 1);
+
+	/*
+	 * Use a different name for the same value to distinguish the concepts.
+	 * Only the relative value of
+	 *     (pos_rate - base_rate) = (pos_ratio - 1) * base_rate
+	 * will be used below, which reflects the direction and size of dirty
+	 * position error.
+	 */
+	pos_rate = executed_rate;
+
+	/*
+	 * dirty_ratelimit will follow balanced_rate iff pos_rate is on the
+	 * same side of dirty_ratelimit, too.
+	 * For example,
+	 * - (base_rate > balanced_rate) => dirty rate is too high
+	 * - (base_rate > pos_rate)      => dirty pages are above setpoint
+	 * so lowering base_rate will help meet both the position and rate
+	 * control targets. Otherwise, don't update base_rate if it will only
+	 * help meet the rate target. After all, what the users ultimately feel
+	 * and care are stable dirty rate and small position error.  This
+	 * update policy can also prevent dirty_ratelimit from being driven
+	 * away by possible systematic errors in balanced_rate.
+	 *
+	 * |base_rate - pos_rate| is also used to limit the step size for
+	 * filtering out the sigular points of balanced_rate, which keeps
+	 * jumping around randomly and can even leap far away at times due to
+	 * the small 200ms estimation period of dirty_rate (we want to keep
+	 * that period small to reduce time lags).
+	 */
+	delta = 0;
+	if (base_rate < balanced_rate) {
+		if (base_rate < pos_rate)
+			delta = min(balanced_rate, pos_rate) - base_rate;
+	} else {
+		if (base_rate > pos_rate)
+			delta = base_rate - max(balanced_rate, pos_rate);
+	}
+
+	/*
+	 * Don't pursue 100% rate matching. It's impossible since the balanced
+	 * rate itself is constantly fluctuating. So decrease the track speed
+	 * when it gets close to the target. Helps eliminate pointless tremors.
+	 */
+	delta >>= base_rate / (8 * delta + 1);
+	/*
+	 * Limit the tracking speed to avoid overshooting.
+	 */
+	delta = (delta + 7) / 8;
+
+	if (base_rate < balanced_rate)
+		base_rate += delta;
+	else
+		base_rate -= delta;
+
+	bdi->dirty_ratelimit = max(base_rate, 1);
+}
+
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
 			    unsigned long bg_thresh,
@@ -783,6 +881,7 @@ void __bdi_update_bandwidth(struct backi
 {
 	unsigned long now = jiffies;
 	unsigned long elapsed = now - bdi->bw_time_stamp;
+	unsigned long dirtied;
 	unsigned long written;
 
 	/*
@@ -791,6 +890,7 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed < BANDWIDTH_INTERVAL)
 		return;
 
+	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
 
 	/*
@@ -800,12 +900,16 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
 		goto snapshot;
 
-	if (thresh)
+	if (thresh) {
 		global_update_bandwidth(thresh, dirty, now);
-
+		bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
+					   bdi_thresh, bdi_dirty,
+					   dirtied, elapsed);
+	}
 	bdi_update_write_bandwidth(bdi, elapsed, written);
 
 snapshot:
+	bdi->dirtied_stamp = dirtied;
 	bdi->written_stamp = written;
 	bdi->bw_time_stamp = now;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 3/5] writeback: dirty rate control
@ 2011-08-16  2:20   ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: dirty-ratelimit --]
[-- Type: text/plain, Size: 13121 bytes --]

It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
when there are N dd tasks.

On write() syscall, use bdi->dirty_ratelimit
============================================

    balance_dirty_pages(pages_dirtied)
    {
        task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
        pause = pages_dirtied / task_ratelimit;
        sleep(pause);
    }

On every 200ms, update bdi->dirty_ratelimit
===========================================

    bdi_update_dirty_ratelimit()
    {
    	pos_rate = ratelimit_in_past_200ms
		 = bdi->dirty_ratelimit * bdi_position_ratio();

	balanced_rate = ratelimit_in_past_200ms * write_bw / dirty_rate;

        update bdi->dirty_ratelimit closer to balanced_rate and pos_rate
    }

Estimation of balanced bdi->dirty_ratelimit
===========================================

balanced task_ratelimit
-----------------------

balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.

IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:

	ratelimit = write_bw						(1)

The fairness requirement gives us:

        task_ratelimit = write_bw / N					(2)

where N is the number of dd tasks.  We don't know N beforehand, but
still can estimate the balanced task_ratelimit within 200ms.

Start by throttling each dd task at rate

        task_ratelimit = task_ratelimit_0				(3)
 		  	 (any non-zero initial value is OK)

After 200ms, we measured

        dirty_rate = # of pages dirtied by all dd's / 200ms
        write_bw   = # of pages written to the disk / 200ms

For the aggressive dd dirtiers, the equality holds

	dirty_rate == N * task_rate
                   == N * task_ratelimit
                   == N * task_ratelimit_0            			(4)
Or
	task_ratelimit_0 = dirty_rate / N            			(5)

Now we conclude that the balanced task ratelimit can be estimated by

        task_ratelimit = task_ratelimit_0 * (write_bw / dirty_rate)	(6)

Because with (4) and (5) we can get the desired equality (1):

	task_ratelimit == (dirty_rate / N) * (write_bw / dirty_rate)
	       	       == write_bw / N

Then using the balanced task ratelimit we can compute task pause times like:

        task_pause = task->nr_dirtied / task_ratelimit

task_ratelimit with position control
------------------------------------

However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.

The dirty position control works by splitting (6) to

        task_ratelimit = balanced_rate					(7)
        balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)	(8)

and extend (7) to

        task_ratelimit = balanced_rate * pos_ratio			(9)

where pos_ratio is a negative feedback function that subjects to

1) f(setpoint) = 1.0
2) df/dx < 0

That is, if the dirty pages are ABOVE the setpoint, we throttle each
task a bit more HEAVY than balanced_rate, so that the dirty pages are
created less fast than they are cleaned, thus DROP to the setpoints
(and the reverse).

bdi->dirty_ratelimit update policy
----------------------------------

The balanced_rate calculated by (8) is not suitable for direct use (*).
For the reasons listed below, (9) is further transformed into

	task_ratelimit = dirty_ratelimit * pos_ratio			(10)

where dirty_ratelimit will be tracking balanced_rate _conservatively_.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
(*) There are some imperfections in balanced_rate, which make it not
suitable for direct use:

1) large fluctuations

The dirty_rate used for computing balanced_rate is merely averaged in
the past 200ms (very small comparing to the 3s estimation period for
write_bw), which makes rather dispersed distribution of balanced_rate.

It's pretty hard to average out the singular points by increasing the
estimation period. Considering that the averaging technique will
introduce very undesirable time lags, I give it up totally. (btw, the 3s
write_bw averaging time lag is much more acceptable because its impact
is one-way and therefore won't lead to oscillations.)

The more practical way is filtering -- most singular balanced_rate
points can be filtered out by remembering some prev_balanced_rate and
prev_prev_balanced_rate. However the more reliable way is to guard
balanced_rate with pos_rate.

2) due to truncates and fs redirties, the (write_bw <=> dirty_rate)
match could become unbalanced, which may lead to large systematical
errors in balanced_rate. The truncates, due to its possibly bumpy
nature, can hardly be compensated smoothly. So let's face it. When some
over-estimated balanced_rate brings dirty_ratelimit high, dirty pages
will go higher than the setpoint. pos_rate will in turn become lower
than dirty_ratelimit.  So if we consider both balanced_rate and pos_rate
and update dirty_ratelimit only when they are on the same side of
dirty_ratelimit, the systematical errors in balanced_rate won't be able
to bring dirty_ratelimit far away.

The balanced_rate estimation may also be inaccurate when near the max
pause and free run areas, however is less an issue.

3) since we ultimately want to

- keep the fluctuations of task ratelimit as small as possible
- keep the dirty pages around the setpoint as long time as possible

the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (pos_rate < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < balanced_rate), there is
no point to bring up dirty_ratelimit in a hurry only to hurt both the
above two goals.

In summary, the dirty_ratelimit update policy consists of two constraints:

1) avoid changing dirty rate when it's against the position control target
   (the adjusted rate will slow down the progress of dirty pages going
   back to setpoint).

2) limit the step size. pos_rate is changing values step by step,
   leaving a consistent trace comparing to the randomly jumping
   balanced_rate. pos_rate also has the nice smaller errors in stable
   state and typically larger errors when there are big errors in rate.
   So it's a pretty good limiting factor for the step size of dirty_ratelimit.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
---
 include/linux/backing-dev.h |    7 ++
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |  108 +++++++++++++++++++++++++++++++++-
 3 files changed, 114 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2011-08-16 10:07:22.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-16 10:07:23.000000000 +0800
@@ -75,10 +75,17 @@ struct backing_dev_info {
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	unsigned long bw_time_stamp;	/* last time write bw is updated */
+	unsigned long dirtied_stamp;
 	unsigned long written_stamp;	/* pages written at bw_time_stamp */
 	unsigned long write_bandwidth;	/* the estimated write bandwidth */
 	unsigned long avg_write_bandwidth; /* further smoothed write bw */
 
+	/*
+	 * The base dirty throttle rate, re-calculated on every 200ms.
+	 * All the bdi tasks' dirty rate will be curbed under it.
+	 */
+	unsigned long dirty_ratelimit;
+
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
--- linux-next.orig/mm/backing-dev.c	2011-08-16 10:07:22.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-08-16 10:07:23.000000000 +0800
@@ -674,6 +674,7 @@ int bdi_init(struct backing_dev_info *bd
 	bdi->bw_time_stamp = jiffies;
 	bdi->written_stamp = 0;
 
+	bdi->dirty_ratelimit = INIT_BW;
 	bdi->write_bandwidth = INIT_BW;
 	bdi->avg_write_bandwidth = INIT_BW;
 
--- linux-next.orig/mm/page-writeback.c	2011-08-16 10:07:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-16 10:13:33.000000000 +0800
@@ -773,6 +773,104 @@ static void global_update_bandwidth(unsi
 	spin_unlock(&dirty_lock);
 }
 
+/*
+ * Maintain bdi->dirty_ratelimit, the base dirty throttle rate.
+ *
+ * Normal bdi tasks will be curbed at or below it in long term.
+ * Obviously it should be around (write_bw / N) when there are N dd tasks.
+ */
+static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
+				       unsigned long thresh,
+				       unsigned long bg_thresh,
+				       unsigned long dirty,
+				       unsigned long bdi_thresh,
+				       unsigned long bdi_dirty,
+				       unsigned long dirtied,
+				       unsigned long elapsed)
+{
+	unsigned long base_rate = bdi->dirty_ratelimit;
+	unsigned long dirty_rate;
+	unsigned long executed_rate;
+	unsigned long balanced_rate;
+	unsigned long pos_rate;
+	unsigned long delta;
+	unsigned long pos_ratio;
+
+	/*
+	 * The dirty rate will match the writeback rate in long term, except
+	 * when dirty pages are truncated by userspace or re-dirtied by FS.
+	 */
+	dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+
+	pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
+				       bdi_thresh, bdi_dirty);
+	/*
+	 * executed_rate reflects each dd's dirty rate for the past 200ms.
+	 */
+	executed_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * A linear estimation of the "balanced" throttle bandwidth.
+	 */
+	balanced_rate = div_u64((u64)executed_rate * bdi->avg_write_bandwidth,
+				dirty_rate | 1);
+
+	/*
+	 * Use a different name for the same value to distinguish the concepts.
+	 * Only the relative value of
+	 *     (pos_rate - base_rate) = (pos_ratio - 1) * base_rate
+	 * will be used below, which reflects the direction and size of dirty
+	 * position error.
+	 */
+	pos_rate = executed_rate;
+
+	/*
+	 * dirty_ratelimit will follow balanced_rate iff pos_rate is on the
+	 * same side of dirty_ratelimit, too.
+	 * For example,
+	 * - (base_rate > balanced_rate) => dirty rate is too high
+	 * - (base_rate > pos_rate)      => dirty pages are above setpoint
+	 * so lowering base_rate will help meet both the position and rate
+	 * control targets. Otherwise, don't update base_rate if it will only
+	 * help meet the rate target. After all, what the users ultimately feel
+	 * and care are stable dirty rate and small position error.  This
+	 * update policy can also prevent dirty_ratelimit from being driven
+	 * away by possible systematic errors in balanced_rate.
+	 *
+	 * |base_rate - pos_rate| is also used to limit the step size for
+	 * filtering out the sigular points of balanced_rate, which keeps
+	 * jumping around randomly and can even leap far away at times due to
+	 * the small 200ms estimation period of dirty_rate (we want to keep
+	 * that period small to reduce time lags).
+	 */
+	delta = 0;
+	if (base_rate < balanced_rate) {
+		if (base_rate < pos_rate)
+			delta = min(balanced_rate, pos_rate) - base_rate;
+	} else {
+		if (base_rate > pos_rate)
+			delta = base_rate - max(balanced_rate, pos_rate);
+	}
+
+	/*
+	 * Don't pursue 100% rate matching. It's impossible since the balanced
+	 * rate itself is constantly fluctuating. So decrease the track speed
+	 * when it gets close to the target. Helps eliminate pointless tremors.
+	 */
+	delta >>= base_rate / (8 * delta + 1);
+	/*
+	 * Limit the tracking speed to avoid overshooting.
+	 */
+	delta = (delta + 7) / 8;
+
+	if (base_rate < balanced_rate)
+		base_rate += delta;
+	else
+		base_rate -= delta;
+
+	bdi->dirty_ratelimit = max(base_rate, 1);
+}
+
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
 			    unsigned long bg_thresh,
@@ -783,6 +881,7 @@ void __bdi_update_bandwidth(struct backi
 {
 	unsigned long now = jiffies;
 	unsigned long elapsed = now - bdi->bw_time_stamp;
+	unsigned long dirtied;
 	unsigned long written;
 
 	/*
@@ -791,6 +890,7 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed < BANDWIDTH_INTERVAL)
 		return;
 
+	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
 
 	/*
@@ -800,12 +900,16 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
 		goto snapshot;
 
-	if (thresh)
+	if (thresh) {
 		global_update_bandwidth(thresh, dirty, now);
-
+		bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
+					   bdi_thresh, bdi_dirty,
+					   dirtied, elapsed);
+	}
 	bdi_update_write_bandwidth(bdi, elapsed, written);
 
 snapshot:
+	bdi->dirtied_stamp = dirtied;
 	bdi->written_stamp = written;
 	bdi->bw_time_stamp = now;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-16  2:20 ` Wu Fengguang
@ 2011-08-16  2:20   ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: per-task-ratelimit --]
[-- Type: text/plain, Size: 7450 bytes --]

Add two fields to task_struct.

1) account dirtied pages in the individual tasks, for accuracy
2) per-task balance_dirty_pages() call intervals, for flexibility

The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
scale near-sqrt to the safety gap between dirty pages and threshold.

The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
pages at exactly the same time, each task will be assigned a large
initial nr_dirtied_pause, so that the dirty threshold will be exceeded
long before each task reached its nr_dirtied_pause and hence call
balance_dirty_pages().

The solution is to watch for the number of pages dirtied on each CPU in
between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
(3% dirty threshold), force call balance_dirty_pages() for a chance to
set bdi->dirty_exceeded. In normal situations, this safeguarding
condition is not expected to trigger at all.

peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    7 +++
 kernel/fork.c         |    3 +
 mm/page-writeback.c   |   90 ++++++++++++++++++++++------------------
 3 files changed, 61 insertions(+), 39 deletions(-)

--- linux-next.orig/include/linux/sched.h	2011-08-14 18:03:44.000000000 +0800
+++ linux-next/include/linux/sched.h	2011-08-15 10:26:05.000000000 +0800
@@ -1525,6 +1525,13 @@ struct task_struct {
 	int make_it_fail;
 #endif
 	struct prop_local_single dirties;
+	/*
+	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
+	 * balance_dirty_pages() for some dirty throttling pause
+	 */
+	int nr_dirtied;
+	int nr_dirtied_pause;
+
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
 	struct latency_record latency_record[LT_SAVECOUNT];
--- linux-next.orig/mm/page-writeback.c	2011-08-15 10:26:04.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-15 13:51:16.000000000 +0800
@@ -54,20 +54,6 @@
  */
 static long ratelimit_pages = 32;
 
-/*
- * When balance_dirty_pages decides that the caller needs to perform some
- * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than dirtied pages to ensure that reasonably
- * large amounts of I/O are submitted.
- */
-static inline long sync_writeback_pages(unsigned long dirtied)
-{
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
-	return dirtied + dirtied / 2;
-}
-
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
@@ -169,6 +155,8 @@ static void update_completion_period(voi
 	int shift = calc_period_shift();
 	prop_change_shift(&vm_completions, shift);
 	prop_change_shift(&vm_dirties, shift);
+
+	writeback_set_ratelimit();
 }
 
 int dirty_background_ratio_handler(struct ctl_table *table, int write,
@@ -930,6 +918,23 @@ static void bdi_update_bandwidth(struct 
 }
 
 /*
+ * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
+ * will look to see if it needs to start dirty throttling.
+ *
+ * If dirty_poll_interval is too low, big NUMA machines will call the expensive
+ * global_page_state() too often. So scale it near-sqrt to the safety margin
+ * (the number of pages we may dirty without exceeding the dirty limits).
+ */
+static unsigned long dirty_poll_interval(unsigned long dirty,
+					 unsigned long thresh)
+{
+	if (thresh > dirty)
+		return 1UL << (ilog2(thresh - dirty) >> 1);
+
+	return 1;
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -1072,6 +1077,9 @@ static void balance_dirty_pages(struct a
 	if (clear_dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
+	current->nr_dirtied = 0;
+	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
+
 	if (writeback_in_progress(bdi))
 		return;
 
@@ -1098,7 +1106,7 @@ void set_page_dirty_balance(struct page 
 	}
 }
 
-static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
+static DEFINE_PER_CPU(int, bdp_ratelimits);
 
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
@@ -1118,31 +1126,40 @@ void balance_dirty_pages_ratelimited_nr(
 					unsigned long nr_pages_dirtied)
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
-	unsigned long ratelimit;
-	unsigned long *p;
+	int ratelimit;
+	int *p;
 
 	if (!bdi_cap_account_dirty(bdi))
 		return;
 
-	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
-		ratelimit = 8;
+	if (!bdi->dirty_exceeded)
+		ratelimit = current->nr_dirtied_pause;
+	else
+		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
+
+	current->nr_dirtied += nr_pages_dirtied;
 
+	preempt_disable();
 	/*
-	 * Check the rate limiting. Also, we do not want to throttle real-time
-	 * tasks in balance_dirty_pages(). Period.
+	 * This prevents one CPU to accumulate too many dirtied pages without
+	 * calling into balance_dirty_pages(), which can happen when there are
+	 * 1000+ tasks, all of them start dirtying pages at exactly the same
+	 * time, hence all honoured too large initial task->nr_dirtied_pause.
 	 */
-	preempt_disable();
 	p =  &__get_cpu_var(bdp_ratelimits);
-	*p += nr_pages_dirtied;
-	if (unlikely(*p >= ratelimit)) {
-		ratelimit = sync_writeback_pages(*p);
+	if (unlikely(current->nr_dirtied >= ratelimit))
 		*p = 0;
-		preempt_enable();
-		balance_dirty_pages(mapping, ratelimit);
-		return;
+	else {
+		*p += nr_pages_dirtied;
+		if (unlikely(*p >= ratelimit_pages)) {
+			*p = 0;
+			ratelimit = 0;
+		}
 	}
 	preempt_enable();
+
+	if (unlikely(current->nr_dirtied >= ratelimit))
+		balance_dirty_pages(mapping, current->nr_dirtied);
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
@@ -1237,22 +1254,17 @@ void laptop_sync_completion(void)
  *
  * Here we set ratelimit_pages to a level which ensures that when all CPUs are
  * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
- * thresholds before writeback cuts in.
- *
- * But the limit should not be set too high.  Because it also controls the
- * amount of memory which the balance_dirty_pages() caller has to write back.
- * If this is too large then the caller will block on the IO queue all the
- * time.  So limit it to four megabytes - the balance_dirty_pages() caller
- * will write six megabyte chunks, max.
+ * thresholds.
  */
 
 void writeback_set_ratelimit(void)
 {
-	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
+	unsigned long background_thresh;
+	unsigned long dirty_thresh;
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+	ratelimit_pages = dirty_thresh / (num_online_cpus() * 32);
 	if (ratelimit_pages < 16)
 		ratelimit_pages = 16;
-	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
-		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
 }
 
 static int __cpuinit
--- linux-next.orig/kernel/fork.c	2011-08-14 18:03:44.000000000 +0800
+++ linux-next/kernel/fork.c	2011-08-15 10:26:05.000000000 +0800
@@ -1301,6 +1301,9 @@ static struct task_struct *copy_process(
 	p->pdeath_signal = 0;
 	p->exit_state = 0;
 
+	p->nr_dirtied = 0;
+	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
+
 	/*
 	 * Ok, make it visible to the rest of the system.
 	 * We dont wake it up yet.



^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-16  2:20   ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: per-task-ratelimit --]
[-- Type: text/plain, Size: 7753 bytes --]

Add two fields to task_struct.

1) account dirtied pages in the individual tasks, for accuracy
2) per-task balance_dirty_pages() call intervals, for flexibility

The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
scale near-sqrt to the safety gap between dirty pages and threshold.

The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
pages at exactly the same time, each task will be assigned a large
initial nr_dirtied_pause, so that the dirty threshold will be exceeded
long before each task reached its nr_dirtied_pause and hence call
balance_dirty_pages().

The solution is to watch for the number of pages dirtied on each CPU in
between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
(3% dirty threshold), force call balance_dirty_pages() for a chance to
set bdi->dirty_exceeded. In normal situations, this safeguarding
condition is not expected to trigger at all.

peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    7 +++
 kernel/fork.c         |    3 +
 mm/page-writeback.c   |   90 ++++++++++++++++++++++------------------
 3 files changed, 61 insertions(+), 39 deletions(-)

--- linux-next.orig/include/linux/sched.h	2011-08-14 18:03:44.000000000 +0800
+++ linux-next/include/linux/sched.h	2011-08-15 10:26:05.000000000 +0800
@@ -1525,6 +1525,13 @@ struct task_struct {
 	int make_it_fail;
 #endif
 	struct prop_local_single dirties;
+	/*
+	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
+	 * balance_dirty_pages() for some dirty throttling pause
+	 */
+	int nr_dirtied;
+	int nr_dirtied_pause;
+
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
 	struct latency_record latency_record[LT_SAVECOUNT];
--- linux-next.orig/mm/page-writeback.c	2011-08-15 10:26:04.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-15 13:51:16.000000000 +0800
@@ -54,20 +54,6 @@
  */
 static long ratelimit_pages = 32;
 
-/*
- * When balance_dirty_pages decides that the caller needs to perform some
- * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than dirtied pages to ensure that reasonably
- * large amounts of I/O are submitted.
- */
-static inline long sync_writeback_pages(unsigned long dirtied)
-{
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
-	return dirtied + dirtied / 2;
-}
-
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
@@ -169,6 +155,8 @@ static void update_completion_period(voi
 	int shift = calc_period_shift();
 	prop_change_shift(&vm_completions, shift);
 	prop_change_shift(&vm_dirties, shift);
+
+	writeback_set_ratelimit();
 }
 
 int dirty_background_ratio_handler(struct ctl_table *table, int write,
@@ -930,6 +918,23 @@ static void bdi_update_bandwidth(struct 
 }
 
 /*
+ * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
+ * will look to see if it needs to start dirty throttling.
+ *
+ * If dirty_poll_interval is too low, big NUMA machines will call the expensive
+ * global_page_state() too often. So scale it near-sqrt to the safety margin
+ * (the number of pages we may dirty without exceeding the dirty limits).
+ */
+static unsigned long dirty_poll_interval(unsigned long dirty,
+					 unsigned long thresh)
+{
+	if (thresh > dirty)
+		return 1UL << (ilog2(thresh - dirty) >> 1);
+
+	return 1;
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -1072,6 +1077,9 @@ static void balance_dirty_pages(struct a
 	if (clear_dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
+	current->nr_dirtied = 0;
+	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
+
 	if (writeback_in_progress(bdi))
 		return;
 
@@ -1098,7 +1106,7 @@ void set_page_dirty_balance(struct page 
 	}
 }
 
-static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
+static DEFINE_PER_CPU(int, bdp_ratelimits);
 
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
@@ -1118,31 +1126,40 @@ void balance_dirty_pages_ratelimited_nr(
 					unsigned long nr_pages_dirtied)
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
-	unsigned long ratelimit;
-	unsigned long *p;
+	int ratelimit;
+	int *p;
 
 	if (!bdi_cap_account_dirty(bdi))
 		return;
 
-	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
-		ratelimit = 8;
+	if (!bdi->dirty_exceeded)
+		ratelimit = current->nr_dirtied_pause;
+	else
+		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
+
+	current->nr_dirtied += nr_pages_dirtied;
 
+	preempt_disable();
 	/*
-	 * Check the rate limiting. Also, we do not want to throttle real-time
-	 * tasks in balance_dirty_pages(). Period.
+	 * This prevents one CPU to accumulate too many dirtied pages without
+	 * calling into balance_dirty_pages(), which can happen when there are
+	 * 1000+ tasks, all of them start dirtying pages at exactly the same
+	 * time, hence all honoured too large initial task->nr_dirtied_pause.
 	 */
-	preempt_disable();
 	p =  &__get_cpu_var(bdp_ratelimits);
-	*p += nr_pages_dirtied;
-	if (unlikely(*p >= ratelimit)) {
-		ratelimit = sync_writeback_pages(*p);
+	if (unlikely(current->nr_dirtied >= ratelimit))
 		*p = 0;
-		preempt_enable();
-		balance_dirty_pages(mapping, ratelimit);
-		return;
+	else {
+		*p += nr_pages_dirtied;
+		if (unlikely(*p >= ratelimit_pages)) {
+			*p = 0;
+			ratelimit = 0;
+		}
 	}
 	preempt_enable();
+
+	if (unlikely(current->nr_dirtied >= ratelimit))
+		balance_dirty_pages(mapping, current->nr_dirtied);
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
@@ -1237,22 +1254,17 @@ void laptop_sync_completion(void)
  *
  * Here we set ratelimit_pages to a level which ensures that when all CPUs are
  * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
- * thresholds before writeback cuts in.
- *
- * But the limit should not be set too high.  Because it also controls the
- * amount of memory which the balance_dirty_pages() caller has to write back.
- * If this is too large then the caller will block on the IO queue all the
- * time.  So limit it to four megabytes - the balance_dirty_pages() caller
- * will write six megabyte chunks, max.
+ * thresholds.
  */
 
 void writeback_set_ratelimit(void)
 {
-	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
+	unsigned long background_thresh;
+	unsigned long dirty_thresh;
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+	ratelimit_pages = dirty_thresh / (num_online_cpus() * 32);
 	if (ratelimit_pages < 16)
 		ratelimit_pages = 16;
-	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
-		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
 }
 
 static int __cpuinit
--- linux-next.orig/kernel/fork.c	2011-08-14 18:03:44.000000000 +0800
+++ linux-next/kernel/fork.c	2011-08-15 10:26:05.000000000 +0800
@@ -1301,6 +1301,9 @@ static struct task_struct *copy_process(
 	p->pdeath_signal = 0;
 	p->exit_state = 0;
 
+	p->nr_dirtied = 0;
+	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
+
 	/*
 	 * Ok, make it visible to the rest of the system.
 	 * We dont wake it up yet.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-16  2:20 ` Wu Fengguang
  (?)
@ 2011-08-16  2:20   ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-ioless-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 15084 bytes --]

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

RATIONALS
=========

- disk seeks on concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)

  With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
  from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".

  * "CPU usage has dropped by ~55%", "it certainly appears that most of
    the CPU time saving comes from the removal of contention on the
    inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
    cacheline bouncing, because the new code is able to call much less
    frequently into balance_dirty_pages() and hence access the global
    page states)

  * the user space "App overhead" is reduced by 20%, by avoiding the
    cacheline pollution by the complex writeback code path

  * "for a ~5% throughput reduction", "the number of write IOs have
    dropped by ~25%", and the elapsed time reduced from 41:42.17 to
    40:53.23.

  * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
    and improves IO throughput from 38MB/s to 42MB/s.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

  Now it's possible to increase writeback chunk size proportional to the
  disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
  the larger writeback size dramatically reduces the seek count to 1/10
  (far beyond my expectation) and improves the write throughput by 24%.

- long block time in balance_dirty_pages() hurts desktop responsiveness

  Many of us may have the experience: it often takes a couple of seconds
  or even long time to stop a heavy writing dd/cp/tar command with
  Ctrl-C or "kill -9".

- IO pipeline broken by bumpy write() progress

  There are a broad class of "loop {read(buf); write(buf);}" applications
  whose read() pipeline will be under-utilized or even come to a stop if
  the write()s have long latencies _or_ don't progress in a constant rate.
  The current threshold based throttling inherently transfers the large
  low level IO completion fluctuations to bumpy application write()s,
  and further deteriorates with increasing number of dirtiers and/or bdi's.

  For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
  the rsync progresses very bumpy in legacy kernel, and throughput is
  improved by 67% by this patchset. (plus the larger write chunk size,
  it will be 93% speedup).

  The new rate based throttling can support 1000+ dd's with excellent
  smoothness, low latency and low overheads.

For the above reasons, it's much better to do IO-less and low latency
pauses in balance_dirty_pages().

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than   4ms, which burns CPU power)
- avoid too large pause time (more than 200ms, which hurts responsiveness)
- avoid big fluctuations of pause times

It can control pause times at will. The default policy (in a followup
patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
in 1000-dd case.

BEHAVIOR CHANGE
===============

(1) dirty threshold

Users will notice that the applications will get throttled once crossing
the global (background + dirty)/2=15% threshold, and then balanced around
17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
memory in 1-dd case.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than 15% dirtyable memory.

(2) smoothness/responsiveness

Users will notice a more responsive system during heavy writeback.
"killall dd" will take effect instantly.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   24 ----
 mm/page-writeback.c              |  147 ++++++++---------------------
 2 files changed, 41 insertions(+), 130 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-15 14:09:01.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-16 08:50:46.000000000 +0800
@@ -250,50 +250,6 @@ static void bdi_writeout_fraction(struct
 				numerator, denominator);
 }
 
-static inline void task_dirties_fraction(struct task_struct *tsk,
-		long *numerator, long *denominator)
-{
-	prop_fraction_single(&vm_dirties, &tsk->dirties,
-				numerator, denominator);
-}
-
-/*
- * task_dirty_limit - scale down dirty throttling threshold for one task
- *
- * task specific dirty limit:
- *
- *   dirty -= (dirty/8) * p_{t}
- *
- * To protect light/slow dirtying tasks from heavier/fast ones, we start
- * throttling individual tasks before reaching the bdi dirty limit.
- * Relatively low thresholds will be allocated to heavy dirtiers. So when
- * dirty pages grow large, heavy dirtiers will be throttled first, which will
- * effectively curb the growth of dirty pages. Light dirtiers with high enough
- * dirty threshold may never get throttled.
- */
-#define TASK_LIMIT_FRACTION 8
-static unsigned long task_dirty_limit(struct task_struct *tsk,
-				       unsigned long bdi_dirty)
-{
-	long numerator, denominator;
-	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty / TASK_LIMIT_FRACTION;
-
-	task_dirties_fraction(tsk, &numerator, &denominator);
-	inv *= numerator;
-	do_div(inv, denominator);
-
-	dirty -= inv;
-
-	return max(dirty, bdi_dirty/2);
-}
-
-/* Minimum limit for any task */
-static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
-{
-	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
-}
-
 /*
  *
  */
@@ -939,29 +895,34 @@ static unsigned long dirty_poll_interval
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
- * the caller to perform writeback if the system is over `vm_dirty_ratio'.
+ * the caller to wait once crossing the (background_thresh + dirty_thresh) / 2.
  * If we're over `background_thresh' then the writeback threads are woken to
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
-	unsigned long nr_reclaimable, bdi_nr_reclaimable;
+	unsigned long nr_reclaimable;
 	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	unsigned long bdi_dirty;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long task_bdi_thresh;
-	unsigned long min_task_bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
+	long pause = 0;
 	bool dirty_exceeded = false;
-	bool clear_dirty_exceeded = true;
+	unsigned long task_ratelimit;
+	unsigned long base_rate;
+	unsigned long pos_ratio;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
 
 	for (;;) {
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
@@ -978,8 +939,6 @@ static void balance_dirty_pages(struct a
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
-		min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh);
-		task_bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
@@ -991,57 +950,41 @@ static void balance_dirty_pages(struct a
 		 * actually dirty; with m+n sitting in the percpu
 		 * deltas.
 		 */
-		if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+		if (bdi_thresh < 2 * bdi_stat_error(bdi))
+			bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
 				    bdi_stat_sum(bdi, BDI_WRITEBACK);
-		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+		else
+			bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) +
 				    bdi_stat(bdi, BDI_WRITEBACK);
-		}
 
-		/*
-		 * The bdi thresh is somehow "soft" limit derived from the
-		 * global "hard" limit. The former helps to prevent heavy IO
-		 * bdi or process from holding back light ones; The latter is
-		 * the last resort safeguard.
-		 */
-		dirty_exceeded = (bdi_dirty > task_bdi_thresh) ||
+		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
 				  (nr_dirty > dirty_thresh);
-		clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) &&
-					(nr_dirty <= dirty_thresh);
-
-		if (!dirty_exceeded)
-			break;
-
-		if (!bdi->dirty_exceeded)
+		if (dirty_exceeded && !bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
 		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
 				     nr_dirty, bdi_thresh, bdi_dirty,
 				     start_time);
 
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_balance_dirty_start(bdi);
-		if (bdi_nr_reclaimable > task_bdi_thresh) {
-			pages_written += writeback_inodes_wb(&bdi->wb,
-							     write_chunk);
-			trace_balance_dirty_written(bdi, pages_written);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
+		if (unlikely(!writeback_in_progress(bdi)))
+			bdi_start_background_writeback(bdi);
+
+		base_rate = bdi->dirty_ratelimit;
+		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
+					       background_thresh, nr_dirty,
+					       bdi_thresh, bdi_dirty);
+		if (unlikely(pos_ratio == 0)) {
+			pause = MAX_PAUSE;
+			goto pause;
 		}
+		task_ratelimit = (u64)base_rate *
+					pos_ratio >> RATELIMIT_CALC_SHIFT;
+		pause = (HZ * pages_dirtied) / (task_ratelimit | 1);
+		pause = min(pause, MAX_PAUSE);
+
+pause:
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
-		trace_balance_dirty_wait(bdi);
 
 		dirty_thresh = hard_dirty_limit(dirty_thresh);
 		/*
@@ -1051,8 +994,7 @@ static void balance_dirty_pages(struct a
 		 * (b) the pause time limit makes the dirtiers more responsive.
 		 */
 		if (nr_dirty < dirty_thresh +
-			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
-		    time_after(jiffies, start_time + MAX_PAUSE))
+			       dirty_thresh / DIRTY_MAXPAUSE_AREA)
 			break;
 		/*
 		 * pass-good area. When some bdi gets blocked (eg. NFS server
@@ -1065,18 +1007,9 @@ static void balance_dirty_pages(struct a
 			       dirty_thresh / DIRTY_PASSGOOD_AREA &&
 		    bdi_dirty < bdi_thresh)
 			break;
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
 	}
 
-	/* Clear dirty_exceeded flag only when no task can exceed the limit */
-	if (clear_dirty_exceeded && bdi->dirty_exceeded)
+	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
 	current->nr_dirtied = 0;
@@ -1093,8 +1026,10 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	if (laptop_mode)
+		return;
+
+	if (nr_reclaimable > background_thresh)
 		bdi_start_background_writeback(bdi);
 }
 
--- linux-next.orig/include/trace/events/writeback.h	2011-08-15 13:59:09.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-08-16 08:50:46.000000000 +0800
@@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
 DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
 DEFINE_WRITEBACK_EVENT(writeback_thread_start);
 DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
-DEFINE_WRITEBACK_EVENT(balance_dirty_start);
-DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
-
-TRACE_EVENT(balance_dirty_written,
-
-	TP_PROTO(struct backing_dev_info *bdi, int written),
-
-	TP_ARGS(bdi, written),
-
-	TP_STRUCT__entry(
-		__array(char,	name, 32)
-		__field(int,	written)
-	),
-
-	TP_fast_assign(
-		strncpy(__entry->name, dev_name(bdi->dev), 32);
-		__entry->written = written;
-	),
-
-	TP_printk("bdi %s written %d",
-		  __entry->name,
-		  __entry->written
-	)
-);
 
 DECLARE_EVENT_CLASS(wbc_class,
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),



^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-16  2:20   ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-ioless-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 15387 bytes --]

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

RATIONALS
=========

- disk seeks on concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)

  With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
  from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".

  * "CPU usage has dropped by ~55%", "it certainly appears that most of
    the CPU time saving comes from the removal of contention on the
    inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
    cacheline bouncing, because the new code is able to call much less
    frequently into balance_dirty_pages() and hence access the global
    page states)

  * the user space "App overhead" is reduced by 20%, by avoiding the
    cacheline pollution by the complex writeback code path

  * "for a ~5% throughput reduction", "the number of write IOs have
    dropped by ~25%", and the elapsed time reduced from 41:42.17 to
    40:53.23.

  * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
    and improves IO throughput from 38MB/s to 42MB/s.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

  Now it's possible to increase writeback chunk size proportional to the
  disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
  the larger writeback size dramatically reduces the seek count to 1/10
  (far beyond my expectation) and improves the write throughput by 24%.

- long block time in balance_dirty_pages() hurts desktop responsiveness

  Many of us may have the experience: it often takes a couple of seconds
  or even long time to stop a heavy writing dd/cp/tar command with
  Ctrl-C or "kill -9".

- IO pipeline broken by bumpy write() progress

  There are a broad class of "loop {read(buf); write(buf);}" applications
  whose read() pipeline will be under-utilized or even come to a stop if
  the write()s have long latencies _or_ don't progress in a constant rate.
  The current threshold based throttling inherently transfers the large
  low level IO completion fluctuations to bumpy application write()s,
  and further deteriorates with increasing number of dirtiers and/or bdi's.

  For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
  the rsync progresses very bumpy in legacy kernel, and throughput is
  improved by 67% by this patchset. (plus the larger write chunk size,
  it will be 93% speedup).

  The new rate based throttling can support 1000+ dd's with excellent
  smoothness, low latency and low overheads.

For the above reasons, it's much better to do IO-less and low latency
pauses in balance_dirty_pages().

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than   4ms, which burns CPU power)
- avoid too large pause time (more than 200ms, which hurts responsiveness)
- avoid big fluctuations of pause times

It can control pause times at will. The default policy (in a followup
patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
in 1000-dd case.

BEHAVIOR CHANGE
===============

(1) dirty threshold

Users will notice that the applications will get throttled once crossing
the global (background + dirty)/2=15% threshold, and then balanced around
17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
memory in 1-dd case.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than 15% dirtyable memory.

(2) smoothness/responsiveness

Users will notice a more responsive system during heavy writeback.
"killall dd" will take effect instantly.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   24 ----
 mm/page-writeback.c              |  147 ++++++++---------------------
 2 files changed, 41 insertions(+), 130 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-15 14:09:01.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-16 08:50:46.000000000 +0800
@@ -250,50 +250,6 @@ static void bdi_writeout_fraction(struct
 				numerator, denominator);
 }
 
-static inline void task_dirties_fraction(struct task_struct *tsk,
-		long *numerator, long *denominator)
-{
-	prop_fraction_single(&vm_dirties, &tsk->dirties,
-				numerator, denominator);
-}
-
-/*
- * task_dirty_limit - scale down dirty throttling threshold for one task
- *
- * task specific dirty limit:
- *
- *   dirty -= (dirty/8) * p_{t}
- *
- * To protect light/slow dirtying tasks from heavier/fast ones, we start
- * throttling individual tasks before reaching the bdi dirty limit.
- * Relatively low thresholds will be allocated to heavy dirtiers. So when
- * dirty pages grow large, heavy dirtiers will be throttled first, which will
- * effectively curb the growth of dirty pages. Light dirtiers with high enough
- * dirty threshold may never get throttled.
- */
-#define TASK_LIMIT_FRACTION 8
-static unsigned long task_dirty_limit(struct task_struct *tsk,
-				       unsigned long bdi_dirty)
-{
-	long numerator, denominator;
-	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty / TASK_LIMIT_FRACTION;
-
-	task_dirties_fraction(tsk, &numerator, &denominator);
-	inv *= numerator;
-	do_div(inv, denominator);
-
-	dirty -= inv;
-
-	return max(dirty, bdi_dirty/2);
-}
-
-/* Minimum limit for any task */
-static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
-{
-	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
-}
-
 /*
  *
  */
@@ -939,29 +895,34 @@ static unsigned long dirty_poll_interval
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
- * the caller to perform writeback if the system is over `vm_dirty_ratio'.
+ * the caller to wait once crossing the (background_thresh + dirty_thresh) / 2.
  * If we're over `background_thresh' then the writeback threads are woken to
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
-	unsigned long nr_reclaimable, bdi_nr_reclaimable;
+	unsigned long nr_reclaimable;
 	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	unsigned long bdi_dirty;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long task_bdi_thresh;
-	unsigned long min_task_bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
+	long pause = 0;
 	bool dirty_exceeded = false;
-	bool clear_dirty_exceeded = true;
+	unsigned long task_ratelimit;
+	unsigned long base_rate;
+	unsigned long pos_ratio;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
 
 	for (;;) {
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
@@ -978,8 +939,6 @@ static void balance_dirty_pages(struct a
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
-		min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh);
-		task_bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
@@ -991,57 +950,41 @@ static void balance_dirty_pages(struct a
 		 * actually dirty; with m+n sitting in the percpu
 		 * deltas.
 		 */
-		if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+		if (bdi_thresh < 2 * bdi_stat_error(bdi))
+			bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
 				    bdi_stat_sum(bdi, BDI_WRITEBACK);
-		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+		else
+			bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) +
 				    bdi_stat(bdi, BDI_WRITEBACK);
-		}
 
-		/*
-		 * The bdi thresh is somehow "soft" limit derived from the
-		 * global "hard" limit. The former helps to prevent heavy IO
-		 * bdi or process from holding back light ones; The latter is
-		 * the last resort safeguard.
-		 */
-		dirty_exceeded = (bdi_dirty > task_bdi_thresh) ||
+		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
 				  (nr_dirty > dirty_thresh);
-		clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) &&
-					(nr_dirty <= dirty_thresh);
-
-		if (!dirty_exceeded)
-			break;
-
-		if (!bdi->dirty_exceeded)
+		if (dirty_exceeded && !bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
 		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
 				     nr_dirty, bdi_thresh, bdi_dirty,
 				     start_time);
 
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_balance_dirty_start(bdi);
-		if (bdi_nr_reclaimable > task_bdi_thresh) {
-			pages_written += writeback_inodes_wb(&bdi->wb,
-							     write_chunk);
-			trace_balance_dirty_written(bdi, pages_written);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
+		if (unlikely(!writeback_in_progress(bdi)))
+			bdi_start_background_writeback(bdi);
+
+		base_rate = bdi->dirty_ratelimit;
+		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
+					       background_thresh, nr_dirty,
+					       bdi_thresh, bdi_dirty);
+		if (unlikely(pos_ratio == 0)) {
+			pause = MAX_PAUSE;
+			goto pause;
 		}
+		task_ratelimit = (u64)base_rate *
+					pos_ratio >> RATELIMIT_CALC_SHIFT;
+		pause = (HZ * pages_dirtied) / (task_ratelimit | 1);
+		pause = min(pause, MAX_PAUSE);
+
+pause:
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
-		trace_balance_dirty_wait(bdi);
 
 		dirty_thresh = hard_dirty_limit(dirty_thresh);
 		/*
@@ -1051,8 +994,7 @@ static void balance_dirty_pages(struct a
 		 * (b) the pause time limit makes the dirtiers more responsive.
 		 */
 		if (nr_dirty < dirty_thresh +
-			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
-		    time_after(jiffies, start_time + MAX_PAUSE))
+			       dirty_thresh / DIRTY_MAXPAUSE_AREA)
 			break;
 		/*
 		 * pass-good area. When some bdi gets blocked (eg. NFS server
@@ -1065,18 +1007,9 @@ static void balance_dirty_pages(struct a
 			       dirty_thresh / DIRTY_PASSGOOD_AREA &&
 		    bdi_dirty < bdi_thresh)
 			break;
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
 	}
 
-	/* Clear dirty_exceeded flag only when no task can exceed the limit */
-	if (clear_dirty_exceeded && bdi->dirty_exceeded)
+	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
 	current->nr_dirtied = 0;
@@ -1093,8 +1026,10 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	if (laptop_mode)
+		return;
+
+	if (nr_reclaimable > background_thresh)
 		bdi_start_background_writeback(bdi);
 }
 
--- linux-next.orig/include/trace/events/writeback.h	2011-08-15 13:59:09.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-08-16 08:50:46.000000000 +0800
@@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
 DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
 DEFINE_WRITEBACK_EVENT(writeback_thread_start);
 DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
-DEFINE_WRITEBACK_EVENT(balance_dirty_start);
-DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
-
-TRACE_EVENT(balance_dirty_written,
-
-	TP_PROTO(struct backing_dev_info *bdi, int written),
-
-	TP_ARGS(bdi, written),
-
-	TP_STRUCT__entry(
-		__array(char,	name, 32)
-		__field(int,	written)
-	),
-
-	TP_fast_assign(
-		strncpy(__entry->name, dev_name(bdi->dev), 32);
-		__entry->written = written;
-	),
-
-	TP_printk("bdi %s written %d",
-		  __entry->name,
-		  __entry->written
-	)
-);
 
 DECLARE_EVENT_CLASS(wbc_class,
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-16  2:20   ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-ioless-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 15387 bytes --]

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

RATIONALS
=========

- disk seeks on concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)

  With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
  from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".

  * "CPU usage has dropped by ~55%", "it certainly appears that most of
    the CPU time saving comes from the removal of contention on the
    inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
    cacheline bouncing, because the new code is able to call much less
    frequently into balance_dirty_pages() and hence access the global
    page states)

  * the user space "App overhead" is reduced by 20%, by avoiding the
    cacheline pollution by the complex writeback code path

  * "for a ~5% throughput reduction", "the number of write IOs have
    dropped by ~25%", and the elapsed time reduced from 41:42.17 to
    40:53.23.

  * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
    and improves IO throughput from 38MB/s to 42MB/s.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

  Now it's possible to increase writeback chunk size proportional to the
  disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
  the larger writeback size dramatically reduces the seek count to 1/10
  (far beyond my expectation) and improves the write throughput by 24%.

- long block time in balance_dirty_pages() hurts desktop responsiveness

  Many of us may have the experience: it often takes a couple of seconds
  or even long time to stop a heavy writing dd/cp/tar command with
  Ctrl-C or "kill -9".

- IO pipeline broken by bumpy write() progress

  There are a broad class of "loop {read(buf); write(buf);}" applications
  whose read() pipeline will be under-utilized or even come to a stop if
  the write()s have long latencies _or_ don't progress in a constant rate.
  The current threshold based throttling inherently transfers the large
  low level IO completion fluctuations to bumpy application write()s,
  and further deteriorates with increasing number of dirtiers and/or bdi's.

  For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
  the rsync progresses very bumpy in legacy kernel, and throughput is
  improved by 67% by this patchset. (plus the larger write chunk size,
  it will be 93% speedup).

  The new rate based throttling can support 1000+ dd's with excellent
  smoothness, low latency and low overheads.

For the above reasons, it's much better to do IO-less and low latency
pauses in balance_dirty_pages().

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than   4ms, which burns CPU power)
- avoid too large pause time (more than 200ms, which hurts responsiveness)
- avoid big fluctuations of pause times

It can control pause times at will. The default policy (in a followup
patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
in 1000-dd case.

BEHAVIOR CHANGE
===============

(1) dirty threshold

Users will notice that the applications will get throttled once crossing
the global (background + dirty)/2=15% threshold, and then balanced around
17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
memory in 1-dd case.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than 15% dirtyable memory.

(2) smoothness/responsiveness

Users will notice a more responsive system during heavy writeback.
"killall dd" will take effect instantly.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   24 ----
 mm/page-writeback.c              |  147 ++++++++---------------------
 2 files changed, 41 insertions(+), 130 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-15 14:09:01.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-16 08:50:46.000000000 +0800
@@ -250,50 +250,6 @@ static void bdi_writeout_fraction(struct
 				numerator, denominator);
 }
 
-static inline void task_dirties_fraction(struct task_struct *tsk,
-		long *numerator, long *denominator)
-{
-	prop_fraction_single(&vm_dirties, &tsk->dirties,
-				numerator, denominator);
-}
-
-/*
- * task_dirty_limit - scale down dirty throttling threshold for one task
- *
- * task specific dirty limit:
- *
- *   dirty -= (dirty/8) * p_{t}
- *
- * To protect light/slow dirtying tasks from heavier/fast ones, we start
- * throttling individual tasks before reaching the bdi dirty limit.
- * Relatively low thresholds will be allocated to heavy dirtiers. So when
- * dirty pages grow large, heavy dirtiers will be throttled first, which will
- * effectively curb the growth of dirty pages. Light dirtiers with high enough
- * dirty threshold may never get throttled.
- */
-#define TASK_LIMIT_FRACTION 8
-static unsigned long task_dirty_limit(struct task_struct *tsk,
-				       unsigned long bdi_dirty)
-{
-	long numerator, denominator;
-	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty / TASK_LIMIT_FRACTION;
-
-	task_dirties_fraction(tsk, &numerator, &denominator);
-	inv *= numerator;
-	do_div(inv, denominator);
-
-	dirty -= inv;
-
-	return max(dirty, bdi_dirty/2);
-}
-
-/* Minimum limit for any task */
-static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
-{
-	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
-}
-
 /*
  *
  */
@@ -939,29 +895,34 @@ static unsigned long dirty_poll_interval
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
- * the caller to perform writeback if the system is over `vm_dirty_ratio'.
+ * the caller to wait once crossing the (background_thresh + dirty_thresh) / 2.
  * If we're over `background_thresh' then the writeback threads are woken to
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
-	unsigned long nr_reclaimable, bdi_nr_reclaimable;
+	unsigned long nr_reclaimable;
 	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	unsigned long bdi_dirty;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long task_bdi_thresh;
-	unsigned long min_task_bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
+	long pause = 0;
 	bool dirty_exceeded = false;
-	bool clear_dirty_exceeded = true;
+	unsigned long task_ratelimit;
+	unsigned long base_rate;
+	unsigned long pos_ratio;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
 
 	for (;;) {
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
@@ -978,8 +939,6 @@ static void balance_dirty_pages(struct a
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
-		min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh);
-		task_bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
@@ -991,57 +950,41 @@ static void balance_dirty_pages(struct a
 		 * actually dirty; with m+n sitting in the percpu
 		 * deltas.
 		 */
-		if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+		if (bdi_thresh < 2 * bdi_stat_error(bdi))
+			bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
 				    bdi_stat_sum(bdi, BDI_WRITEBACK);
-		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+		else
+			bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) +
 				    bdi_stat(bdi, BDI_WRITEBACK);
-		}
 
-		/*
-		 * The bdi thresh is somehow "soft" limit derived from the
-		 * global "hard" limit. The former helps to prevent heavy IO
-		 * bdi or process from holding back light ones; The latter is
-		 * the last resort safeguard.
-		 */
-		dirty_exceeded = (bdi_dirty > task_bdi_thresh) ||
+		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
 				  (nr_dirty > dirty_thresh);
-		clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) &&
-					(nr_dirty <= dirty_thresh);
-
-		if (!dirty_exceeded)
-			break;
-
-		if (!bdi->dirty_exceeded)
+		if (dirty_exceeded && !bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
 		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
 				     nr_dirty, bdi_thresh, bdi_dirty,
 				     start_time);
 
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_balance_dirty_start(bdi);
-		if (bdi_nr_reclaimable > task_bdi_thresh) {
-			pages_written += writeback_inodes_wb(&bdi->wb,
-							     write_chunk);
-			trace_balance_dirty_written(bdi, pages_written);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
+		if (unlikely(!writeback_in_progress(bdi)))
+			bdi_start_background_writeback(bdi);
+
+		base_rate = bdi->dirty_ratelimit;
+		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
+					       background_thresh, nr_dirty,
+					       bdi_thresh, bdi_dirty);
+		if (unlikely(pos_ratio == 0)) {
+			pause = MAX_PAUSE;
+			goto pause;
 		}
+		task_ratelimit = (u64)base_rate *
+					pos_ratio >> RATELIMIT_CALC_SHIFT;
+		pause = (HZ * pages_dirtied) / (task_ratelimit | 1);
+		pause = min(pause, MAX_PAUSE);
+
+pause:
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
-		trace_balance_dirty_wait(bdi);
 
 		dirty_thresh = hard_dirty_limit(dirty_thresh);
 		/*
@@ -1051,8 +994,7 @@ static void balance_dirty_pages(struct a
 		 * (b) the pause time limit makes the dirtiers more responsive.
 		 */
 		if (nr_dirty < dirty_thresh +
-			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
-		    time_after(jiffies, start_time + MAX_PAUSE))
+			       dirty_thresh / DIRTY_MAXPAUSE_AREA)
 			break;
 		/*
 		 * pass-good area. When some bdi gets blocked (eg. NFS server
@@ -1065,18 +1007,9 @@ static void balance_dirty_pages(struct a
 			       dirty_thresh / DIRTY_PASSGOOD_AREA &&
 		    bdi_dirty < bdi_thresh)
 			break;
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
 	}
 
-	/* Clear dirty_exceeded flag only when no task can exceed the limit */
-	if (clear_dirty_exceeded && bdi->dirty_exceeded)
+	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
 	current->nr_dirtied = 0;
@@ -1093,8 +1026,10 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	if (laptop_mode)
+		return;
+
+	if (nr_reclaimable > background_thresh)
 		bdi_start_background_writeback(bdi);
 }
 
--- linux-next.orig/include/trace/events/writeback.h	2011-08-15 13:59:09.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-08-16 08:50:46.000000000 +0800
@@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
 DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
 DEFINE_WRITEBACK_EVENT(writeback_thread_start);
 DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
-DEFINE_WRITEBACK_EVENT(balance_dirty_start);
-DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
-
-TRACE_EVENT(balance_dirty_written,
-
-	TP_PROTO(struct backing_dev_info *bdi, int written),
-
-	TP_ARGS(bdi, written),
-
-	TP_STRUCT__entry(
-		__array(char,	name, 32)
-		__field(int,	written)
-	),
-
-	TP_fast_assign(
-		strncpy(__entry->name, dev_name(bdi->dev), 32);
-		__entry->written = written;
-	),
-
-	TP_printk("bdi %s written %d",
-		  __entry->name,
-		  __entry->written
-	)
-);
 
 DECLARE_EVENT_CLASS(wbc_class,
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-16  2:20   ` Wu Fengguang
@ 2011-08-16  7:17     ` Andrea Righi
  -1 siblings, 0 replies; 203+ messages in thread
From: Andrea Righi @ 2011-08-16  7:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, linux-mm, LKML

On Tue, Aug 16, 2011 at 10:20:10AM +0800, Wu Fengguang wrote:
> Add two fields to task_struct.
> 
> 1) account dirtied pages in the individual tasks, for accuracy
> 2) per-task balance_dirty_pages() call intervals, for flexibility
> 
> The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> scale near-sqrt to the safety gap between dirty pages and threshold.
> 
> The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
> pages at exactly the same time, each task will be assigned a large
> initial nr_dirtied_pause, so that the dirty threshold will be exceeded
> long before each task reached its nr_dirtied_pause and hence call
> balance_dirty_pages().
> 
> The solution is to watch for the number of pages dirtied on each CPU in
> between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
> (3% dirty threshold), force call balance_dirty_pages() for a chance to
> set bdi->dirty_exceeded. In normal situations, this safeguarding
> condition is not expected to trigger at all.
> 
> peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case
> 
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Reviewed-by: Andrea Righi <andrea@betterlinux.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/sched.h |    7 +++
>  kernel/fork.c         |    3 +
>  mm/page-writeback.c   |   90 ++++++++++++++++++++++------------------
>  3 files changed, 61 insertions(+), 39 deletions(-)
> 
> --- linux-next.orig/include/linux/sched.h	2011-08-14 18:03:44.000000000 +0800
> +++ linux-next/include/linux/sched.h	2011-08-15 10:26:05.000000000 +0800
> @@ -1525,6 +1525,13 @@ struct task_struct {
>  	int make_it_fail;
>  #endif
>  	struct prop_local_single dirties;
> +	/*
> +	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
> +	 * balance_dirty_pages() for some dirty throttling pause
> +	 */
> +	int nr_dirtied;
> +	int nr_dirtied_pause;
> +
>  #ifdef CONFIG_LATENCYTOP
>  	int latency_record_count;
>  	struct latency_record latency_record[LT_SAVECOUNT];
> --- linux-next.orig/mm/page-writeback.c	2011-08-15 10:26:04.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-15 13:51:16.000000000 +0800
> @@ -54,20 +54,6 @@
>   */
>  static long ratelimit_pages = 32;
>  
> -/*
> - * When balance_dirty_pages decides that the caller needs to perform some
> - * non-background writeback, this is how many pages it will attempt to write.
> - * It should be somewhat larger than dirtied pages to ensure that reasonably
> - * large amounts of I/O are submitted.
> - */
> -static inline long sync_writeback_pages(unsigned long dirtied)
> -{
> -	if (dirtied < ratelimit_pages)
> -		dirtied = ratelimit_pages;
> -
> -	return dirtied + dirtied / 2;
> -}
> -
>  /* The following parameters are exported via /proc/sys/vm */
>  
>  /*
> @@ -169,6 +155,8 @@ static void update_completion_period(voi
>  	int shift = calc_period_shift();
>  	prop_change_shift(&vm_completions, shift);
>  	prop_change_shift(&vm_dirties, shift);
> +
> +	writeback_set_ratelimit();
>  }
>  
>  int dirty_background_ratio_handler(struct ctl_table *table, int write,
> @@ -930,6 +918,23 @@ static void bdi_update_bandwidth(struct 
>  }
>  
>  /*
> + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
> + * will look to see if it needs to start dirty throttling.
> + *
> + * If dirty_poll_interval is too low, big NUMA machines will call the expensive
> + * global_page_state() too often. So scale it near-sqrt to the safety margin
> + * (the number of pages we may dirty without exceeding the dirty limits).
> + */
> +static unsigned long dirty_poll_interval(unsigned long dirty,
> +					 unsigned long thresh)
> +{
> +	if (thresh > dirty)
> +		return 1UL << (ilog2(thresh - dirty) >> 1);
> +
> +	return 1;
> +}
> +
> +/*
>   * balance_dirty_pages() must be called by processes which are generating dirty
>   * data.  It looks at the number of dirty pages in the machine and will force
>   * the caller to perform writeback if the system is over `vm_dirty_ratio'.
> @@ -1072,6 +1077,9 @@ static void balance_dirty_pages(struct a
>  	if (clear_dirty_exceeded && bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;
>  
> +	current->nr_dirtied = 0;
> +	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
> +
>  	if (writeback_in_progress(bdi))
>  		return;
>  
> @@ -1098,7 +1106,7 @@ void set_page_dirty_balance(struct page 
>  	}
>  }
>  
> -static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
> +static DEFINE_PER_CPU(int, bdp_ratelimits);
>  
>  /**
>   * balance_dirty_pages_ratelimited_nr - balance dirty memory state
> @@ -1118,31 +1126,40 @@ void balance_dirty_pages_ratelimited_nr(
>  					unsigned long nr_pages_dirtied)
>  {
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> -	unsigned long ratelimit;
> -	unsigned long *p;
> +	int ratelimit;
> +	int *p;
>  
>  	if (!bdi_cap_account_dirty(bdi))
>  		return;
>  
> -	ratelimit = ratelimit_pages;
> -	if (mapping->backing_dev_info->dirty_exceeded)
> -		ratelimit = 8;
> +	if (!bdi->dirty_exceeded)
> +		ratelimit = current->nr_dirtied_pause;
> +	else
> +		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));

Usage of ratelimit before init?

Maybe:

	ratelimit = current->nr_dirtied_pause;
	if (bdi->dirty_exceeded)
		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));

Thanks,
-Andrea

> +
> +	current->nr_dirtied += nr_pages_dirtied;
>  
> +	preempt_disable();
>  	/*
> -	 * Check the rate limiting. Also, we do not want to throttle real-time
> -	 * tasks in balance_dirty_pages(). Period.
> +	 * This prevents one CPU to accumulate too many dirtied pages without
> +	 * calling into balance_dirty_pages(), which can happen when there are
> +	 * 1000+ tasks, all of them start dirtying pages at exactly the same
> +	 * time, hence all honoured too large initial task->nr_dirtied_pause.
>  	 */
> -	preempt_disable();
>  	p =  &__get_cpu_var(bdp_ratelimits);
> -	*p += nr_pages_dirtied;
> -	if (unlikely(*p >= ratelimit)) {
> -		ratelimit = sync_writeback_pages(*p);
> +	if (unlikely(current->nr_dirtied >= ratelimit))
>  		*p = 0;
> -		preempt_enable();
> -		balance_dirty_pages(mapping, ratelimit);
> -		return;
> +	else {
> +		*p += nr_pages_dirtied;
> +		if (unlikely(*p >= ratelimit_pages)) {
> +			*p = 0;
> +			ratelimit = 0;
> +		}
>  	}
>  	preempt_enable();
> +
> +	if (unlikely(current->nr_dirtied >= ratelimit))
> +		balance_dirty_pages(mapping, current->nr_dirtied);
>  }
>  EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
>  
> @@ -1237,22 +1254,17 @@ void laptop_sync_completion(void)
>   *
>   * Here we set ratelimit_pages to a level which ensures that when all CPUs are
>   * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
> - * thresholds before writeback cuts in.
> - *
> - * But the limit should not be set too high.  Because it also controls the
> - * amount of memory which the balance_dirty_pages() caller has to write back.
> - * If this is too large then the caller will block on the IO queue all the
> - * time.  So limit it to four megabytes - the balance_dirty_pages() caller
> - * will write six megabyte chunks, max.
> + * thresholds.
>   */
>  
>  void writeback_set_ratelimit(void)
>  {
> -	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
> +	unsigned long background_thresh;
> +	unsigned long dirty_thresh;
> +	global_dirty_limits(&background_thresh, &dirty_thresh);
> +	ratelimit_pages = dirty_thresh / (num_online_cpus() * 32);
>  	if (ratelimit_pages < 16)
>  		ratelimit_pages = 16;
> -	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
> -		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
>  }
>  
>  static int __cpuinit
> --- linux-next.orig/kernel/fork.c	2011-08-14 18:03:44.000000000 +0800
> +++ linux-next/kernel/fork.c	2011-08-15 10:26:05.000000000 +0800
> @@ -1301,6 +1301,9 @@ static struct task_struct *copy_process(
>  	p->pdeath_signal = 0;
>  	p->exit_state = 0;
>  
> +	p->nr_dirtied = 0;
> +	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
> +
>  	/*
>  	 * Ok, make it visible to the rest of the system.
>  	 * We dont wake it up yet.
> 

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-16  7:17     ` Andrea Righi
  0 siblings, 0 replies; 203+ messages in thread
From: Andrea Righi @ 2011-08-16  7:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, linux-mm, LKML

On Tue, Aug 16, 2011 at 10:20:10AM +0800, Wu Fengguang wrote:
> Add two fields to task_struct.
> 
> 1) account dirtied pages in the individual tasks, for accuracy
> 2) per-task balance_dirty_pages() call intervals, for flexibility
> 
> The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> scale near-sqrt to the safety gap between dirty pages and threshold.
> 
> The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
> pages at exactly the same time, each task will be assigned a large
> initial nr_dirtied_pause, so that the dirty threshold will be exceeded
> long before each task reached its nr_dirtied_pause and hence call
> balance_dirty_pages().
> 
> The solution is to watch for the number of pages dirtied on each CPU in
> between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
> (3% dirty threshold), force call balance_dirty_pages() for a chance to
> set bdi->dirty_exceeded. In normal situations, this safeguarding
> condition is not expected to trigger at all.
> 
> peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case
> 
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Reviewed-by: Andrea Righi <andrea@betterlinux.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/sched.h |    7 +++
>  kernel/fork.c         |    3 +
>  mm/page-writeback.c   |   90 ++++++++++++++++++++++------------------
>  3 files changed, 61 insertions(+), 39 deletions(-)
> 
> --- linux-next.orig/include/linux/sched.h	2011-08-14 18:03:44.000000000 +0800
> +++ linux-next/include/linux/sched.h	2011-08-15 10:26:05.000000000 +0800
> @@ -1525,6 +1525,13 @@ struct task_struct {
>  	int make_it_fail;
>  #endif
>  	struct prop_local_single dirties;
> +	/*
> +	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
> +	 * balance_dirty_pages() for some dirty throttling pause
> +	 */
> +	int nr_dirtied;
> +	int nr_dirtied_pause;
> +
>  #ifdef CONFIG_LATENCYTOP
>  	int latency_record_count;
>  	struct latency_record latency_record[LT_SAVECOUNT];
> --- linux-next.orig/mm/page-writeback.c	2011-08-15 10:26:04.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-15 13:51:16.000000000 +0800
> @@ -54,20 +54,6 @@
>   */
>  static long ratelimit_pages = 32;
>  
> -/*
> - * When balance_dirty_pages decides that the caller needs to perform some
> - * non-background writeback, this is how many pages it will attempt to write.
> - * It should be somewhat larger than dirtied pages to ensure that reasonably
> - * large amounts of I/O are submitted.
> - */
> -static inline long sync_writeback_pages(unsigned long dirtied)
> -{
> -	if (dirtied < ratelimit_pages)
> -		dirtied = ratelimit_pages;
> -
> -	return dirtied + dirtied / 2;
> -}
> -
>  /* The following parameters are exported via /proc/sys/vm */
>  
>  /*
> @@ -169,6 +155,8 @@ static void update_completion_period(voi
>  	int shift = calc_period_shift();
>  	prop_change_shift(&vm_completions, shift);
>  	prop_change_shift(&vm_dirties, shift);
> +
> +	writeback_set_ratelimit();
>  }
>  
>  int dirty_background_ratio_handler(struct ctl_table *table, int write,
> @@ -930,6 +918,23 @@ static void bdi_update_bandwidth(struct 
>  }
>  
>  /*
> + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
> + * will look to see if it needs to start dirty throttling.
> + *
> + * If dirty_poll_interval is too low, big NUMA machines will call the expensive
> + * global_page_state() too often. So scale it near-sqrt to the safety margin
> + * (the number of pages we may dirty without exceeding the dirty limits).
> + */
> +static unsigned long dirty_poll_interval(unsigned long dirty,
> +					 unsigned long thresh)
> +{
> +	if (thresh > dirty)
> +		return 1UL << (ilog2(thresh - dirty) >> 1);
> +
> +	return 1;
> +}
> +
> +/*
>   * balance_dirty_pages() must be called by processes which are generating dirty
>   * data.  It looks at the number of dirty pages in the machine and will force
>   * the caller to perform writeback if the system is over `vm_dirty_ratio'.
> @@ -1072,6 +1077,9 @@ static void balance_dirty_pages(struct a
>  	if (clear_dirty_exceeded && bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;
>  
> +	current->nr_dirtied = 0;
> +	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
> +
>  	if (writeback_in_progress(bdi))
>  		return;
>  
> @@ -1098,7 +1106,7 @@ void set_page_dirty_balance(struct page 
>  	}
>  }
>  
> -static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
> +static DEFINE_PER_CPU(int, bdp_ratelimits);
>  
>  /**
>   * balance_dirty_pages_ratelimited_nr - balance dirty memory state
> @@ -1118,31 +1126,40 @@ void balance_dirty_pages_ratelimited_nr(
>  					unsigned long nr_pages_dirtied)
>  {
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> -	unsigned long ratelimit;
> -	unsigned long *p;
> +	int ratelimit;
> +	int *p;
>  
>  	if (!bdi_cap_account_dirty(bdi))
>  		return;
>  
> -	ratelimit = ratelimit_pages;
> -	if (mapping->backing_dev_info->dirty_exceeded)
> -		ratelimit = 8;
> +	if (!bdi->dirty_exceeded)
> +		ratelimit = current->nr_dirtied_pause;
> +	else
> +		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));

Usage of ratelimit before init?

Maybe:

	ratelimit = current->nr_dirtied_pause;
	if (bdi->dirty_exceeded)
		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));

Thanks,
-Andrea

> +
> +	current->nr_dirtied += nr_pages_dirtied;
>  
> +	preempt_disable();
>  	/*
> -	 * Check the rate limiting. Also, we do not want to throttle real-time
> -	 * tasks in balance_dirty_pages(). Period.
> +	 * This prevents one CPU to accumulate too many dirtied pages without
> +	 * calling into balance_dirty_pages(), which can happen when there are
> +	 * 1000+ tasks, all of them start dirtying pages at exactly the same
> +	 * time, hence all honoured too large initial task->nr_dirtied_pause.
>  	 */
> -	preempt_disable();
>  	p =  &__get_cpu_var(bdp_ratelimits);
> -	*p += nr_pages_dirtied;
> -	if (unlikely(*p >= ratelimit)) {
> -		ratelimit = sync_writeback_pages(*p);
> +	if (unlikely(current->nr_dirtied >= ratelimit))
>  		*p = 0;
> -		preempt_enable();
> -		balance_dirty_pages(mapping, ratelimit);
> -		return;
> +	else {
> +		*p += nr_pages_dirtied;
> +		if (unlikely(*p >= ratelimit_pages)) {
> +			*p = 0;
> +			ratelimit = 0;
> +		}
>  	}
>  	preempt_enable();
> +
> +	if (unlikely(current->nr_dirtied >= ratelimit))
> +		balance_dirty_pages(mapping, current->nr_dirtied);
>  }
>  EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
>  
> @@ -1237,22 +1254,17 @@ void laptop_sync_completion(void)
>   *
>   * Here we set ratelimit_pages to a level which ensures that when all CPUs are
>   * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
> - * thresholds before writeback cuts in.
> - *
> - * But the limit should not be set too high.  Because it also controls the
> - * amount of memory which the balance_dirty_pages() caller has to write back.
> - * If this is too large then the caller will block on the IO queue all the
> - * time.  So limit it to four megabytes - the balance_dirty_pages() caller
> - * will write six megabyte chunks, max.
> + * thresholds.
>   */
>  
>  void writeback_set_ratelimit(void)
>  {
> -	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
> +	unsigned long background_thresh;
> +	unsigned long dirty_thresh;
> +	global_dirty_limits(&background_thresh, &dirty_thresh);
> +	ratelimit_pages = dirty_thresh / (num_online_cpus() * 32);
>  	if (ratelimit_pages < 16)
>  		ratelimit_pages = 16;
> -	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
> -		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
>  }
>  
>  static int __cpuinit
> --- linux-next.orig/kernel/fork.c	2011-08-14 18:03:44.000000000 +0800
> +++ linux-next/kernel/fork.c	2011-08-15 10:26:05.000000000 +0800
> @@ -1301,6 +1301,9 @@ static struct task_struct *copy_process(
>  	p->pdeath_signal = 0;
>  	p->exit_state = 0;
>  
> +	p->nr_dirtied = 0;
> +	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
> +
>  	/*
>  	 * Ok, make it visible to the rest of the system.
>  	 * We dont wake it up yet.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-16  7:17     ` Andrea Righi
@ 2011-08-16  7:22       ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  7:22 UTC (permalink / raw)
  To: Andrea Righi
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, linux-mm, LKML

> > +	if (!bdi->dirty_exceeded)
> > +		ratelimit = current->nr_dirtied_pause;
> > +	else
> > +		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
> 
> Usage of ratelimit before init?
> 
> Maybe:
> 
> 	ratelimit = current->nr_dirtied_pause;
> 	if (bdi->dirty_exceeded)
> 		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));

Good catch, thanks! That's indeed the original form. I changed it to
make the code more aligned...

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-16  7:22       ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  7:22 UTC (permalink / raw)
  To: Andrea Righi
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, linux-mm, LKML

> > +	if (!bdi->dirty_exceeded)
> > +		ratelimit = current->nr_dirtied_pause;
> > +	else
> > +		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
> 
> Usage of ratelimit before init?
> 
> Maybe:
> 
> 	ratelimit = current->nr_dirtied_pause;
> 	if (bdi->dirty_exceeded)
> 		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));

Good catch, thanks! That's indeed the original form. I changed it to
make the code more aligned...

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-16  2:20   ` Wu Fengguang
@ 2011-08-16 19:41     ` Jan Kara
  -1 siblings, 0 replies; 203+ messages in thread
From: Jan Kara @ 2011-08-16 19:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

  Hello Fengguang,

  this patch is much easier to read than in older versions! Good work!

> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* the target balance point */
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                         setpoint - dirty 3
> +	 *        f(dirty) := 1 + (----------------)
> +	 *                         limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx       < 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> +	 * fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly.
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> +	setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 */
> +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> +		       thresh + 1);
  I think you can slightly simplify this to:
(thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;


> +	x_intercept = setpoint + 2 * span;
  What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
easily 500 MB, that happens quite often I imagine?

> +
> +	if (unlikely(bdi_dirty > setpoint + span)) {
> +		if (unlikely(bdi_dirty > limit))
> +			return 0;
  Shouldn't this be bdi_thresh instead of limit? I understand this is a
hard limit but with more bdis this condition is rather weak and almost
never true.

> +		if (x_intercept < limit) {
> +			x_intercept = limit;	/* auxiliary control line */
> +			setpoint += span;
> +			pos_ratio >>= 1;
> +		}
  And here you stretch the control area upto the global dirty limit. I
understand you maybe don't want to be really strict and cut control area at
bdi_thresh but your choice looks like too benevolent - when you have
several active bdi's with different speeds this will effectively erase
difference between them, won't it? E.g. with 10 bdi's (x_intercept -
bdi_dirty) / (x_intercept - setpoint) is going to be close to 1 unless
bdi_dirty really heavily exceeds bdi_thresh. So wouldn't it be better to
just make sure control area is reasonably large (e.g. at least 16 MB) to
allow BDI to ramp up it's bdi_thresh but don't extend it upto global dirty
limit?

> +	}
> +	pos_ratio *= x_intercept - bdi_dirty;
> +	do_div(pos_ratio, x_intercept - setpoint + 1);
> +
> +	return pos_ratio;
> +}
> +

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-16 19:41     ` Jan Kara
  0 siblings, 0 replies; 203+ messages in thread
From: Jan Kara @ 2011-08-16 19:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

  Hello Fengguang,

  this patch is much easier to read than in older versions! Good work!

> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* the target balance point */
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                         setpoint - dirty 3
> +	 *        f(dirty) := 1 + (----------------)
> +	 *                         limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx       < 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> +	 * fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly.
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> +	setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 */
> +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> +		       thresh + 1);
  I think you can slightly simplify this to:
(thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;


> +	x_intercept = setpoint + 2 * span;
  What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
easily 500 MB, that happens quite often I imagine?

> +
> +	if (unlikely(bdi_dirty > setpoint + span)) {
> +		if (unlikely(bdi_dirty > limit))
> +			return 0;
  Shouldn't this be bdi_thresh instead of limit? I understand this is a
hard limit but with more bdis this condition is rather weak and almost
never true.

> +		if (x_intercept < limit) {
> +			x_intercept = limit;	/* auxiliary control line */
> +			setpoint += span;
> +			pos_ratio >>= 1;
> +		}
  And here you stretch the control area upto the global dirty limit. I
understand you maybe don't want to be really strict and cut control area at
bdi_thresh but your choice looks like too benevolent - when you have
several active bdi's with different speeds this will effectively erase
difference between them, won't it? E.g. with 10 bdi's (x_intercept -
bdi_dirty) / (x_intercept - setpoint) is going to be close to 1 unless
bdi_dirty really heavily exceeds bdi_thresh. So wouldn't it be better to
just make sure control area is reasonably large (e.g. at least 16 MB) to
allow BDI to ramp up it's bdi_thresh but don't extend it upto global dirty
limit?

> +	}
> +	pos_ratio *= x_intercept - bdi_dirty;
> +	do_div(pos_ratio, x_intercept - setpoint + 1);
> +
> +	return pos_ratio;
> +}
> +

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-16 19:41     ` Jan Kara
  (?)
@ 2011-08-17 13:23     ` Wu Fengguang
  2011-08-17 13:49         ` Wu Fengguang
  2011-08-17 20:24         ` Jan Kara
  -1 siblings, 2 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-17 13:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 6444 bytes --]

Hi Jan,

On Wed, Aug 17, 2011 at 03:41:12AM +0800, Jan Kara wrote:
>   Hello Fengguang,
> 
>   this patch is much easier to read than in older versions! Good work!

Thank you :)

> > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > +					unsigned long thresh,
> > +					unsigned long bg_thresh,
> > +					unsigned long dirty,
> > +					unsigned long bdi_thresh,
> > +					unsigned long bdi_dirty)
> > +{
> > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > +	unsigned long limit = hard_dirty_limit(thresh);
> > +	unsigned long x_intercept;
> > +	unsigned long setpoint;		/* the target balance point */
> > +	unsigned long span;
> > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > +	long x;
> > +
> > +	if (unlikely(dirty >= limit))
> > +		return 0;
> > +
> > +	/*
> > +	 * global setpoint
> > +	 *
> > +	 *                         setpoint - dirty 3
> > +	 *        f(dirty) := 1 + (----------------)
> > +	 *                         limit - setpoint
> > +	 *
> > +	 * it's a 3rd order polynomial that subjects to
> > +	 *
> > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > +	 * (2) f(setpoint) = 1.0 => the balance point
> > +	 * (3) f(limit)    = 0   => the hard limit
> > +	 * (4) df/dx       < 0	 => negative feedback control
> > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > +	 *     => fast response on large errors; small oscillation near setpoint
> > +	 */
> > +	setpoint = (freerun + limit) / 2;
> > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > +		    limit - setpoint + 1);
> > +	pos_ratio = x;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > +
> > +	/*
> > +	 * bdi setpoint
> > +	 *
> > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> > +	 *
> > +	 * The main bdi control line is a linear function that subjects to
> > +	 *
> > +	 * (1) f(setpoint) = 1.0
> > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > +	 *
> > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > +	 * regularly within range
> > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > +	 * fluctuation range for pos_ratio.
> > +	 *
> > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > +	 * own size, so move the slope over accordingly.
> > +	 */
> > +	if (unlikely(bdi_thresh > thresh))
> > +		bdi_thresh = thresh;
> > +	/*
> > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > +	 */
> > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > +	setpoint = setpoint * (u64)x >> 16;
> > +	/*
> > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > +	 */
> > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > +		       thresh + 1);
>   I think you can slightly simplify this to:
> (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;

Good idea!

> > +	x_intercept = setpoint + 2 * span;
>   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> easily 500 MB, that happens quite often I imagine?

That's fine because I no longer target "bdi_thresh" as some limiting
factor as the global "thresh". Due to it being unstable in small
memory JBOD systems, which is the big and unique problem in JBOD.

> > +
> > +	if (unlikely(bdi_dirty > setpoint + span)) {
> > +		if (unlikely(bdi_dirty > limit))
> > +			return 0;
>   Shouldn't this be bdi_thresh instead of limit? I understand this is a
> hard limit but with more bdis this condition is rather weak and almost
> never true.

Yeah, I mean @limit. @bdi_thresh is made weak in IO-less
balance_dirty_pages() in order to get reasonable smooth dirty rate in
the face of a fluctuating @bdi_thresh.

The tradeoff is to let bdi dirty pages fluctuate more or less freely,
as long as they don't drop low and risk IO queue underflow. The
attached patch tries to prevent the underflow (which is good but not
perfect).

> > +		if (x_intercept < limit) {
> > +			x_intercept = limit;	/* auxiliary control line */
> > +			setpoint += span;
> > +			pos_ratio >>= 1;
> > +		}
>   And here you stretch the control area upto the global dirty limit. I
> understand you maybe don't want to be really strict and cut control area at
> bdi_thresh but your choice looks like too benevolent - when you have
> several active bdi's with different speeds this will effectively erase
> difference between them, won't it? E.g. with 10 bdi's (x_intercept -
> bdi_dirty) / (x_intercept - setpoint) is going to be close to 1 unless
> bdi_dirty really heavily exceeds bdi_thresh.

Yes the auxiliary control line could be very flat (small slope).

However it's not normal for the bdi dirty pages to fall into the
range of auxiliary control line at all. And once it takes effect, 
the pos_ratio is at most 0.5 (which is the value at the connection
point with the main bdi control line) which is strong enough to pull
the dirty pages off the auxiliary bdi control line and into the scope
of main bdi control line.

The auxiliary control line is intended for bringing down the bdi_dirty
of the USB key before 250s (where the "pos bandwidth" line keeps low): 

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1UKEY+1HDD-3G/ext4-4dd-1M-8p-2945M-20%25-2.6.38-rc5-dt6+-2011-02-22-09-46/balance_dirty_pages-pages.png

After that the bdi_dirty will fluctuate around bdi_thresh and won't
grow high and step into the scope of the auxiliary control line.

> So wouldn't it be better to
> just make sure control area is reasonably large (e.g. at least 16 MB) to
> allow BDI to ramp up it's bdi_thresh but don't extend it upto global dirty
> limit?

In order to take bdi_thresh as some semi-strict limit, we need to make
it very stable at first..otherwise the whole control system may fluctuate
violently.

Thanks,
Fengguang

> > +	}
> > +	pos_ratio *= x_intercept - bdi_dirty;
> > +	do_div(pos_ratio, x_intercept - setpoint + 1);
> > +
> > +	return pos_ratio;
> > +}
> > +
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

[-- Attachment #2: bdi-reserve-area --]
[-- Type: text/plain, Size: 2539 bytes --]

Subject: writeback: dirty position control - bdi reserve area
Date: Thu Aug 04 22:16:46 CST 2011

Keep a minimal pool of dirty pages for each bdi, so that the disk IO
queues won't underrun.

It's particularly useful for JBOD and small memory system.

XXX:
When memory is small (in comparison to write bandwidth), this control
line may result in (pos_ratio > 1) at the setpoint and push the dirty
pages high. This is more or less intended because the bdi is in the
danger of IO queue underflow. However the global dirty pages, when
pushed close to limit, will eventually conteract our desire to push up
the low bdi_dirty. In low memory JBOD tests we do see disks
under-utilized from time to time.

One scheme that may completely fix this is to add a BDI_queue_empty to
indicate the block IO queue emptiness (but still there may be in flight
IOs on the driver/hardware side) and to unthrottle the tasks regardless
of the global limit on seeing BDI_queue_empty.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-16 09:06:46.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-16 09:06:50.000000000 +0800
@@ -488,6 +488,16 @@ unsigned long bdi_dirty_limit(struct bac
  *   0 +------------.------------------.----------------------*------------->
  *           freerun^          setpoint^                 limit^   dirty pages
  *
+ * (o) bdi reserve area
+ *
+ * The bdi reserve area tries to keep a reasonable number of dirty pages for
+ * preventing block queue underrun.
+ *
+ * reserve area, scale up rate as dirty pages drop low
+ * |<----------------------------------------------->|
+ * |-------------------------------------------------------*-------|----------
+ * 0                                           bdi setpoint^       ^bdi_thresh
+ *
  * (o) bdi control lines
  *
  * The control lines for the global/bdi setpoints both stretch up to @limit.
@@ -571,6 +581,19 @@ static unsigned long bdi_position_ratio(
 	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
 
 	/*
+	 * bdi reserve area, safeguard against dirty pool underrun and disk idle
+	 */
+	x_intercept = min(bdi->avg_write_bandwidth + 2 * MIN_WRITEBACK_PAGES,
+			  freerun);
+	if (bdi_dirty < x_intercept) {
+		if (bdi_dirty > x_intercept / 8) {
+			pos_ratio *= x_intercept;
+			do_div(pos_ratio, bdi_dirty);
+		} else
+			pos_ratio *= 8;
+	}
+
+	/*
 	 * bdi setpoint
 	 *
 	 *        f(dirty) := 1.0 + k * (dirty - setpoint)

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-17 13:23     ` Wu Fengguang
@ 2011-08-17 13:49         ` Wu Fengguang
  2011-08-17 20:24         ` Jan Kara
  1 sibling, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-17 13:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> > > +		if (x_intercept < limit) {
> > > +			x_intercept = limit;	/* auxiliary control line */
> > > +			setpoint += span;
> > > +			pos_ratio >>= 1;
> > > +		}
> >   And here you stretch the control area upto the global dirty limit. I
> > understand you maybe don't want to be really strict and cut control area at
> > bdi_thresh but your choice looks like too benevolent - when you have
> > several active bdi's with different speeds this will effectively erase
> > difference between them, won't it? E.g. with 10 bdi's (x_intercept -
> > bdi_dirty) / (x_intercept - setpoint) is going to be close to 1 unless
> > bdi_dirty really heavily exceeds bdi_thresh.
> 
> Yes the auxiliary control line could be very flat (small slope).
> 
> However it's not normal for the bdi dirty pages to fall into the
> range of auxiliary control line at all. And once it takes effect, 
> the pos_ratio is at most 0.5 (which is the value at the connection
> point with the main bdi control line) which is strong enough to pull
> the dirty pages off the auxiliary bdi control line and into the scope
> of main bdi control line.
> 
> The auxiliary control line is intended for bringing down the bdi_dirty
> of the USB key before 250s (where the "pos bandwidth" line keeps low): 
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1UKEY+1HDD-3G/ext4-4dd-1M-8p-2945M-20%25-2.6.38-rc5-dt6+-2011-02-22-09-46/balance_dirty_pages-pages.png
> 
> After that the bdi_dirty will fluctuate around bdi_thresh and won't
> grow high and step into the scope of the auxiliary control line.

Note that the main/auxiliary bdi control lines won't take effect at
the same time: the main bdi control lines works around and under the
bdi setpoint, and the auxiliary line takes over in the higher scope up
to @limit.

In the 1UKEY+1HDD test case, the bdi_dirty of the UKEY rushes at the
free run stage when global dirty pages are smaller than (thresh+bg_thresh)/2.

So it will be initially under the control the auxiliary line. Hence the
dirtier task will progress at 1/4 to 1/2 of the UKEY's write bandwidth. 
This will bring down the bdi_dirty reasonably fast while still allowing
the dirtier task to make some progress.

The connection point of the main/auxiliary control lines has pos_ratio=0.5.

After 250 second, the main bdi control line takes over, indicated by
the bdi_dirty fluctuating around bdi setpoint and the position rate
(green line) fluctuating around the base ratelimit(blue line).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-17 13:49         ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-17 13:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> > > +		if (x_intercept < limit) {
> > > +			x_intercept = limit;	/* auxiliary control line */
> > > +			setpoint += span;
> > > +			pos_ratio >>= 1;
> > > +		}
> >   And here you stretch the control area upto the global dirty limit. I
> > understand you maybe don't want to be really strict and cut control area at
> > bdi_thresh but your choice looks like too benevolent - when you have
> > several active bdi's with different speeds this will effectively erase
> > difference between them, won't it? E.g. with 10 bdi's (x_intercept -
> > bdi_dirty) / (x_intercept - setpoint) is going to be close to 1 unless
> > bdi_dirty really heavily exceeds bdi_thresh.
> 
> Yes the auxiliary control line could be very flat (small slope).
> 
> However it's not normal for the bdi dirty pages to fall into the
> range of auxiliary control line at all. And once it takes effect, 
> the pos_ratio is at most 0.5 (which is the value at the connection
> point with the main bdi control line) which is strong enough to pull
> the dirty pages off the auxiliary bdi control line and into the scope
> of main bdi control line.
> 
> The auxiliary control line is intended for bringing down the bdi_dirty
> of the USB key before 250s (where the "pos bandwidth" line keeps low): 
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1UKEY+1HDD-3G/ext4-4dd-1M-8p-2945M-20%25-2.6.38-rc5-dt6+-2011-02-22-09-46/balance_dirty_pages-pages.png
> 
> After that the bdi_dirty will fluctuate around bdi_thresh and won't
> grow high and step into the scope of the auxiliary control line.

Note that the main/auxiliary bdi control lines won't take effect at
the same time: the main bdi control lines works around and under the
bdi setpoint, and the auxiliary line takes over in the higher scope up
to @limit.

In the 1UKEY+1HDD test case, the bdi_dirty of the UKEY rushes at the
free run stage when global dirty pages are smaller than (thresh+bg_thresh)/2.

So it will be initially under the control the auxiliary line. Hence the
dirtier task will progress at 1/4 to 1/2 of the UKEY's write bandwidth. 
This will bring down the bdi_dirty reasonably fast while still allowing
the dirtier task to make some progress.

The connection point of the main/auxiliary control lines has pos_ratio=0.5.

After 250 second, the main bdi control line takes over, indicated by
the bdi_dirty fluctuating around bdi setpoint and the position rate
(green line) fluctuating around the base ratelimit(blue line).

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-17 13:23     ` Wu Fengguang
@ 2011-08-17 20:24         ` Jan Kara
  2011-08-17 20:24         ` Jan Kara
  1 sibling, 0 replies; 203+ messages in thread
From: Jan Kara @ 2011-08-17 20:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Peter Zijlstra, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

  Hi Fengguang,

On Wed 17-08-11 21:23:47, Wu Fengguang wrote:
> On Wed, Aug 17, 2011 at 03:41:12AM +0800, Jan Kara wrote:
> > > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > > +					unsigned long thresh,
> > > +					unsigned long bg_thresh,
> > > +					unsigned long dirty,
> > > +					unsigned long bdi_thresh,
> > > +					unsigned long bdi_dirty)
> > > +{
> > > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > > +	unsigned long limit = hard_dirty_limit(thresh);
> > > +	unsigned long x_intercept;
> > > +	unsigned long setpoint;		/* the target balance point */
> > > +	unsigned long span;
> > > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > > +	long x;
> > > +
> > > +	if (unlikely(dirty >= limit))
> > > +		return 0;
> > > +
> > > +	/*
> > > +	 * global setpoint
> > > +	 *
> > > +	 *                         setpoint - dirty 3
> > > +	 *        f(dirty) := 1 + (----------------)
> > > +	 *                         limit - setpoint
> > > +	 *
> > > +	 * it's a 3rd order polynomial that subjects to
> > > +	 *
> > > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > > +	 * (2) f(setpoint) = 1.0 => the balance point
> > > +	 * (3) f(limit)    = 0   => the hard limit
> > > +	 * (4) df/dx       < 0	 => negative feedback control
                          ^^^ Strictly speaking this is <= 0

> > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > +	 */
> > > +	setpoint = (freerun + limit) / 2;
> > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > +		    limit - setpoint + 1);
> > > +	pos_ratio = x;
> > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > +
> > > +	/*
> > > +	 * bdi setpoint
  OK, so if I understand the code right, we now have basic pos_ratio based
on global situation. Now, in the following code, we might scale pos_ratio
further down, if bdi_dirty is too much over bdi's share, right? Do we also
want to scale pos_ratio up, if we are under bdi's share? If yes, do we
really want to do it even if global pos_ratio < 1 (i.e. we are over global
setpoint)?

Maybe we could update the comment with something like:
 * We have computed basic pos_ratio above based on global situation. If the
 * bdi is over its share of dirty pages, we want to scale pos_ratio further
 * down. That is done by the following mechanism:
and now describe how updating works.

> > > +	 *
> > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
                  ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
bdi_setpoint to distinguish clearly from the global value.

> > > +	 *
> > > +	 * The main bdi control line is a linear function that subjects to
> > > +	 *
> > > +	 * (1) f(setpoint) = 1.0
> > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > +	 *
> > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > +	 * regularly within range
> > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > +	 * fluctuation range for pos_ratio.
> > > +	 *
> > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > +	 * own size, so move the slope over accordingly.
> > > +	 */
> > > +	if (unlikely(bdi_thresh > thresh))
> > > +		bdi_thresh = thresh;
> > > +	/*
> > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > +	 */
> > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > +	setpoint = setpoint * (u64)x >> 16;
> > > +	/*
> > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > +	 */
> > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > +		       thresh + 1);
> >   I think you can slightly simplify this to:
> > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> 
> Good idea!
> 
> > > +	x_intercept = setpoint + 2 * span;
   ^^ BTW, why do you have 2*span here? It can result in x_intercept being
~3*bdi_thresh... So maybe you should use bdi_thresh/2 in the computation of
span?

> >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > easily 500 MB, that happens quite often I imagine?
> 
> That's fine because I no longer target "bdi_thresh" as some limiting
> factor as the global "thresh". Due to it being unstable in small
> memory JBOD systems, which is the big and unique problem in JBOD.
  I see. Given the control mechanism below, I think we can try this idea
and see whether it makes problems in practice or not. But the fact that
bdi_thresh is no longer treated as limit should be noted in a changelog -
probably of the last patch (although that is already too long for my taste
so I'll look into how we could make it shorter so that average developer
has enough patience to read it ;).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-17 20:24         ` Jan Kara
  0 siblings, 0 replies; 203+ messages in thread
From: Jan Kara @ 2011-08-17 20:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Peter Zijlstra, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

  Hi Fengguang,

On Wed 17-08-11 21:23:47, Wu Fengguang wrote:
> On Wed, Aug 17, 2011 at 03:41:12AM +0800, Jan Kara wrote:
> > > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > > +					unsigned long thresh,
> > > +					unsigned long bg_thresh,
> > > +					unsigned long dirty,
> > > +					unsigned long bdi_thresh,
> > > +					unsigned long bdi_dirty)
> > > +{
> > > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > > +	unsigned long limit = hard_dirty_limit(thresh);
> > > +	unsigned long x_intercept;
> > > +	unsigned long setpoint;		/* the target balance point */
> > > +	unsigned long span;
> > > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > > +	long x;
> > > +
> > > +	if (unlikely(dirty >= limit))
> > > +		return 0;
> > > +
> > > +	/*
> > > +	 * global setpoint
> > > +	 *
> > > +	 *                         setpoint - dirty 3
> > > +	 *        f(dirty) := 1 + (----------------)
> > > +	 *                         limit - setpoint
> > > +	 *
> > > +	 * it's a 3rd order polynomial that subjects to
> > > +	 *
> > > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > > +	 * (2) f(setpoint) = 1.0 => the balance point
> > > +	 * (3) f(limit)    = 0   => the hard limit
> > > +	 * (4) df/dx       < 0	 => negative feedback control
                          ^^^ Strictly speaking this is <= 0

> > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > +	 */
> > > +	setpoint = (freerun + limit) / 2;
> > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > +		    limit - setpoint + 1);
> > > +	pos_ratio = x;
> > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > +
> > > +	/*
> > > +	 * bdi setpoint
  OK, so if I understand the code right, we now have basic pos_ratio based
on global situation. Now, in the following code, we might scale pos_ratio
further down, if bdi_dirty is too much over bdi's share, right? Do we also
want to scale pos_ratio up, if we are under bdi's share? If yes, do we
really want to do it even if global pos_ratio < 1 (i.e. we are over global
setpoint)?

Maybe we could update the comment with something like:
 * We have computed basic pos_ratio above based on global situation. If the
 * bdi is over its share of dirty pages, we want to scale pos_ratio further
 * down. That is done by the following mechanism:
and now describe how updating works.

> > > +	 *
> > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
                  ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
bdi_setpoint to distinguish clearly from the global value.

> > > +	 *
> > > +	 * The main bdi control line is a linear function that subjects to
> > > +	 *
> > > +	 * (1) f(setpoint) = 1.0
> > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > +	 *
> > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > +	 * regularly within range
> > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > +	 * fluctuation range for pos_ratio.
> > > +	 *
> > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > +	 * own size, so move the slope over accordingly.
> > > +	 */
> > > +	if (unlikely(bdi_thresh > thresh))
> > > +		bdi_thresh = thresh;
> > > +	/*
> > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > +	 */
> > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > +	setpoint = setpoint * (u64)x >> 16;
> > > +	/*
> > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > +	 */
> > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > +		       thresh + 1);
> >   I think you can slightly simplify this to:
> > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> 
> Good idea!
> 
> > > +	x_intercept = setpoint + 2 * span;
   ^^ BTW, why do you have 2*span here? It can result in x_intercept being
~3*bdi_thresh... So maybe you should use bdi_thresh/2 in the computation of
span?

> >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > easily 500 MB, that happens quite often I imagine?
> 
> That's fine because I no longer target "bdi_thresh" as some limiting
> factor as the global "thresh". Due to it being unstable in small
> memory JBOD systems, which is the big and unique problem in JBOD.
  I see. Given the control mechanism below, I think we can try this idea
and see whether it makes problems in practice or not. But the fact that
bdi_thresh is no longer treated as limit should be noted in a changelog -
probably of the last patch (although that is already too long for my taste
so I'll look into how we could make it shorter so that average developer
has enough patience to read it ;).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-17 20:24         ` Jan Kara
@ 2011-08-18  4:18           ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-18  4:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 18, 2011 at 04:24:14AM +0800, Jan Kara wrote:
>   Hi Fengguang,
> 
> On Wed 17-08-11 21:23:47, Wu Fengguang wrote:
> > On Wed, Aug 17, 2011 at 03:41:12AM +0800, Jan Kara wrote:
> > > > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > > > +					unsigned long thresh,
> > > > +					unsigned long bg_thresh,
> > > > +					unsigned long dirty,
> > > > +					unsigned long bdi_thresh,
> > > > +					unsigned long bdi_dirty)
> > > > +{
> > > > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > > > +	unsigned long limit = hard_dirty_limit(thresh);
> > > > +	unsigned long x_intercept;
> > > > +	unsigned long setpoint;		/* the target balance point */
> > > > +	unsigned long span;
> > > > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > > > +	long x;
> > > > +
> > > > +	if (unlikely(dirty >= limit))
> > > > +		return 0;
> > > > +
> > > > +	/*
> > > > +	 * global setpoint
> > > > +	 *
> > > > +	 *                         setpoint - dirty 3
> > > > +	 *        f(dirty) := 1 + (----------------)
> > > > +	 *                         limit - setpoint
> > > > +	 *
> > > > +	 * it's a 3rd order polynomial that subjects to
> > > > +	 *
> > > > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > > > +	 * (2) f(setpoint) = 1.0 => the balance point
> > > > +	 * (3) f(limit)    = 0   => the hard limit
> > > > +	 * (4) df/dx       < 0	 => negative feedback control
>                           ^^^ Strictly speaking this is <= 0

Ah yes, it can be 0 right at the setpoint. 

> > > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > > +	 */
> > > > +	setpoint = (freerun + limit) / 2;
> > > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > > +		    limit - setpoint + 1);
> > > > +	pos_ratio = x;
> > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > > +
> > > > +	/*
> > > > +	 * bdi setpoint
>   OK, so if I understand the code right, we now have basic pos_ratio based
> on global situation. Now, in the following code, we might scale pos_ratio
> further down, if bdi_dirty is too much over bdi's share, right?

Right.

> Do we also want to scale pos_ratio up, if we are under bdi's share?

Yes.

> If yes, do we really want to do it even if global pos_ratio < 1
> (i.e. we are over global setpoint)?

Yes. It's safe because the bdi pos_ratio scale is linear and the
global pos_ratio scale will quickly drop to 0 near @limit, thus
counter-acting any > 1 bdi pos_ratio.

> Maybe we could update the comment with something like:
>  * We have computed basic pos_ratio above based on global situation. If the
>  * bdi is over its share of dirty pages, we want to scale pos_ratio further
>  * down. That is done by the following mechanism:
> and now describe how updating works.

OK.

> > > > +	 *
> > > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
>                   ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
> bdi_setpoint to distinguish clearly from the global value.

OK. I'll add a new variable bdi_setpoint, too, to make it consistent
all over the places.

> > > > +	 *
> > > > +	 * The main bdi control line is a linear function that subjects to
> > > > +	 *
> > > > +	 * (1) f(setpoint) = 1.0
> > > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > > +	 *
> > > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > > +	 * regularly within range
> > > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > > +	 * fluctuation range for pos_ratio.
> > > > +	 *
> > > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > > +	 * own size, so move the slope over accordingly.
> > > > +	 */
> > > > +	if (unlikely(bdi_thresh > thresh))
> > > > +		bdi_thresh = thresh;
> > > > +	/*
> > > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > > +	 */
> > > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > > +	setpoint = setpoint * (u64)x >> 16;
> > > > +	/*
> > > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > > +	 */
> > > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > > +		       thresh + 1);
> > >   I think you can slightly simplify this to:
> > > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> > 
> > Good idea!
> > 
> > > > +	x_intercept = setpoint + 2 * span;
>    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> ~3*bdi_thresh...

Right.

> So maybe you should use bdi_thresh/2 in the computation of span?

Given that at some configurations bdi_thresh can fluctuate to its own
size, I guess the current slope of control line is sharp enough.

Given equations

        span = (x_intercept - bdi_setpoint) / 2
        k = df/dx = -0.5 / span

and the values

        span = bdi_thresh
        dx = bdi_thresh

we get

        df = - dx / (2 * span) = - 1/2

That means, when bdi_dirty deviates bdi_thresh apart, pos_ratio and
hence task ratelimit will fluctuate by -1/2. This is probably more
than the users can tolerate already?

btw. the connection point of main/auxiliary control lines are at

        (x_intercept + bdi_setpoint) / 2 

as shown in the graph of the below updated patch.

> > >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > > easily 500 MB, that happens quite often I imagine?
> > 
> > That's fine because I no longer target "bdi_thresh" as some limiting
> > factor as the global "thresh". Due to it being unstable in small
> > memory JBOD systems, which is the big and unique problem in JBOD.
>   I see. Given the control mechanism below, I think we can try this idea
> and see whether it makes problems in practice or not. But the fact that
> bdi_thresh is no longer treated as limit should be noted in a changelog -
> probably of the last patch (although that is already too long for my taste
> so I'll look into how we could make it shorter so that average developer
> has enough patience to read it ;).

Good point. I'll make it a comment in the last patch.

Thanks,
Fengguang
---
Subject: writeback: dirty position control
Date: Wed Mar 02 16:04:18 CST 2011

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range

	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],

we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = bdi_setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the write bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to bdi_setpoint ~= setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  212 +++++++++++++++++++++++++++++++++++-
 3 files changed, 209 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-18 12:15:24.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -495,6 +503,196 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |<-- span --->| .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0              bdi_setpoint                    x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* dirty pages' target balance point */
+	unsigned long bdi_setpoint;
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                           setpoint - dirty 3
+	 *        f(dirty) := 1.0 + (----------------)
+	 *                           limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx      <= 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * We have computed basic pos_ratio above based on global situation. If
+	 * the bdi is over/under its share of dirty pages, we want to scale
+	 * pos_ratio further down/up. That is done by the following policies:
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
+	 * for various filesystems, so choose a slope that can yield in a
+	 * reasonable 12.5% fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly and choose a slope that
+	 * yields 50% pos_ratio fluctuation when bdi_thresh is suddenly doubled.
+	 */
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
+	 *
+	 *                        x_intercept - bdi_dirty
+	 *                     := --------------------------
+	 *                        x_intercept - bdi_setpoint
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(bdi_setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:
+	 * 	bdi_setpoint = setpoint * bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh + 1);
+	bdi_setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(4*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 *
+	 *        bdi_thresh                  thresh - bdi_thresh
+	 * span = ---------- * (4*write_bw) + ------------------- * bdi_thresh
+	 *          thresh                          thresh
+	 */
+	span = (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) *
+								(u64)x >> 16;
+	x_intercept = bdi_setpoint + 2 * span;
+
+	if (unlikely(bdi_dirty > bdi_setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			bdi_setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -593,6 +791,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -629,6 +828,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -637,8 +837,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -679,7 +879,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -723,8 +924,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-17 20:35:34.000000000 +0800
@@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-17 20:35:34.000000000 +0800
@@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-18  4:18           ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-18  4:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 18, 2011 at 04:24:14AM +0800, Jan Kara wrote:
>   Hi Fengguang,
> 
> On Wed 17-08-11 21:23:47, Wu Fengguang wrote:
> > On Wed, Aug 17, 2011 at 03:41:12AM +0800, Jan Kara wrote:
> > > > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > > > +					unsigned long thresh,
> > > > +					unsigned long bg_thresh,
> > > > +					unsigned long dirty,
> > > > +					unsigned long bdi_thresh,
> > > > +					unsigned long bdi_dirty)
> > > > +{
> > > > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > > > +	unsigned long limit = hard_dirty_limit(thresh);
> > > > +	unsigned long x_intercept;
> > > > +	unsigned long setpoint;		/* the target balance point */
> > > > +	unsigned long span;
> > > > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > > > +	long x;
> > > > +
> > > > +	if (unlikely(dirty >= limit))
> > > > +		return 0;
> > > > +
> > > > +	/*
> > > > +	 * global setpoint
> > > > +	 *
> > > > +	 *                         setpoint - dirty 3
> > > > +	 *        f(dirty) := 1 + (----------------)
> > > > +	 *                         limit - setpoint
> > > > +	 *
> > > > +	 * it's a 3rd order polynomial that subjects to
> > > > +	 *
> > > > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > > > +	 * (2) f(setpoint) = 1.0 => the balance point
> > > > +	 * (3) f(limit)    = 0   => the hard limit
> > > > +	 * (4) df/dx       < 0	 => negative feedback control
>                           ^^^ Strictly speaking this is <= 0

Ah yes, it can be 0 right at the setpoint. 

> > > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > > +	 */
> > > > +	setpoint = (freerun + limit) / 2;
> > > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > > +		    limit - setpoint + 1);
> > > > +	pos_ratio = x;
> > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > > +
> > > > +	/*
> > > > +	 * bdi setpoint
>   OK, so if I understand the code right, we now have basic pos_ratio based
> on global situation. Now, in the following code, we might scale pos_ratio
> further down, if bdi_dirty is too much over bdi's share, right?

Right.

> Do we also want to scale pos_ratio up, if we are under bdi's share?

Yes.

> If yes, do we really want to do it even if global pos_ratio < 1
> (i.e. we are over global setpoint)?

Yes. It's safe because the bdi pos_ratio scale is linear and the
global pos_ratio scale will quickly drop to 0 near @limit, thus
counter-acting any > 1 bdi pos_ratio.

> Maybe we could update the comment with something like:
>  * We have computed basic pos_ratio above based on global situation. If the
>  * bdi is over its share of dirty pages, we want to scale pos_ratio further
>  * down. That is done by the following mechanism:
> and now describe how updating works.

OK.

> > > > +	 *
> > > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
>                   ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
> bdi_setpoint to distinguish clearly from the global value.

OK. I'll add a new variable bdi_setpoint, too, to make it consistent
all over the places.

> > > > +	 *
> > > > +	 * The main bdi control line is a linear function that subjects to
> > > > +	 *
> > > > +	 * (1) f(setpoint) = 1.0
> > > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > > +	 *
> > > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > > +	 * regularly within range
> > > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > > +	 * fluctuation range for pos_ratio.
> > > > +	 *
> > > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > > +	 * own size, so move the slope over accordingly.
> > > > +	 */
> > > > +	if (unlikely(bdi_thresh > thresh))
> > > > +		bdi_thresh = thresh;
> > > > +	/*
> > > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > > +	 */
> > > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > > +	setpoint = setpoint * (u64)x >> 16;
> > > > +	/*
> > > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > > +	 */
> > > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > > +		       thresh + 1);
> > >   I think you can slightly simplify this to:
> > > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> > 
> > Good idea!
> > 
> > > > +	x_intercept = setpoint + 2 * span;
>    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> ~3*bdi_thresh...

Right.

> So maybe you should use bdi_thresh/2 in the computation of span?

Given that at some configurations bdi_thresh can fluctuate to its own
size, I guess the current slope of control line is sharp enough.

Given equations

        span = (x_intercept - bdi_setpoint) / 2
        k = df/dx = -0.5 / span

and the values

        span = bdi_thresh
        dx = bdi_thresh

we get

        df = - dx / (2 * span) = - 1/2

That means, when bdi_dirty deviates bdi_thresh apart, pos_ratio and
hence task ratelimit will fluctuate by -1/2. This is probably more
than the users can tolerate already?

btw. the connection point of main/auxiliary control lines are at

        (x_intercept + bdi_setpoint) / 2 

as shown in the graph of the below updated patch.

> > >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > > easily 500 MB, that happens quite often I imagine?
> > 
> > That's fine because I no longer target "bdi_thresh" as some limiting
> > factor as the global "thresh". Due to it being unstable in small
> > memory JBOD systems, which is the big and unique problem in JBOD.
>   I see. Given the control mechanism below, I think we can try this idea
> and see whether it makes problems in practice or not. But the fact that
> bdi_thresh is no longer treated as limit should be noted in a changelog -
> probably of the last patch (although that is already too long for my taste
> so I'll look into how we could make it shorter so that average developer
> has enough patience to read it ;).

Good point. I'll make it a comment in the last patch.

Thanks,
Fengguang
---
Subject: writeback: dirty position control
Date: Wed Mar 02 16:04:18 CST 2011

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range

	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],

we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = bdi_setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the write bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to bdi_setpoint ~= setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  212 +++++++++++++++++++++++++++++++++++-
 3 files changed, 209 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-18 12:15:24.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -495,6 +503,196 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |<-- span --->| .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0              bdi_setpoint                    x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* dirty pages' target balance point */
+	unsigned long bdi_setpoint;
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                           setpoint - dirty 3
+	 *        f(dirty) := 1.0 + (----------------)
+	 *                           limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx      <= 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * We have computed basic pos_ratio above based on global situation. If
+	 * the bdi is over/under its share of dirty pages, we want to scale
+	 * pos_ratio further down/up. That is done by the following policies:
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
+	 * for various filesystems, so choose a slope that can yield in a
+	 * reasonable 12.5% fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly and choose a slope that
+	 * yields 50% pos_ratio fluctuation when bdi_thresh is suddenly doubled.
+	 */
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
+	 *
+	 *                        x_intercept - bdi_dirty
+	 *                     := --------------------------
+	 *                        x_intercept - bdi_setpoint
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(bdi_setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:
+	 * 	bdi_setpoint = setpoint * bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh + 1);
+	bdi_setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(4*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 *
+	 *        bdi_thresh                  thresh - bdi_thresh
+	 * span = ---------- * (4*write_bw) + ------------------- * bdi_thresh
+	 *          thresh                          thresh
+	 */
+	span = (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) *
+								(u64)x >> 16;
+	x_intercept = bdi_setpoint + 2 * span;
+
+	if (unlikely(bdi_dirty > bdi_setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			bdi_setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -593,6 +791,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -629,6 +828,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -637,8 +837,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -679,7 +879,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -723,8 +924,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-17 20:35:34.000000000 +0800
@@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-17 20:35:34.000000000 +0800
@@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-18  4:18           ` Wu Fengguang
@ 2011-08-18  4:41             ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-18  4:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

Hi Jan,

> > > >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > > > easily 500 MB, that happens quite often I imagine?
> > > 
> > > That's fine because I no longer target "bdi_thresh" as some limiting
> > > factor as the global "thresh". Due to it being unstable in small
> > > memory JBOD systems, which is the big and unique problem in JBOD.
> >   I see. Given the control mechanism below, I think we can try this idea
> > and see whether it makes problems in practice or not. But the fact that
> > bdi_thresh is no longer treated as limit should be noted in a changelog -
> > probably of the last patch (although that is already too long for my taste
> > so I'll look into how we could make it shorter so that average developer
> > has enough patience to read it ;).
> 
> Good point. I'll make it a comment in the last patch.

Just added this comment:

+               /*
+                * bdi_thresh is not treated as some limiting factor as
+                * dirty_thresh, due to reasons
+                * - in JBOD setup, bdi_thresh can fluctuate a lot
+                * - in a system with HDD and USB key, the USB key may somehow
+                *   go into state (bdi_dirty >> bdi_thresh) either because
+                *   bdi_dirty starts high, or because bdi_thresh drops low.
+                *   In this case we don't want to hard throttle the USB key
+                *   dirtiers for 100 seconds until bdi_dirty drops under
+                *   bdi_thresh. Instead the auxiliary bdi control line in
+                *   bdi_position_ratio() will let the dirtier task progress
+                *   at some rate <= (write_bw / 2) for bringing down bdi_dirty.
+                */
                bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-18  4:41             ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-18  4:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

Hi Jan,

> > > >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > > > easily 500 MB, that happens quite often I imagine?
> > > 
> > > That's fine because I no longer target "bdi_thresh" as some limiting
> > > factor as the global "thresh". Due to it being unstable in small
> > > memory JBOD systems, which is the big and unique problem in JBOD.
> >   I see. Given the control mechanism below, I think we can try this idea
> > and see whether it makes problems in practice or not. But the fact that
> > bdi_thresh is no longer treated as limit should be noted in a changelog -
> > probably of the last patch (although that is already too long for my taste
> > so I'll look into how we could make it shorter so that average developer
> > has enough patience to read it ;).
> 
> Good point. I'll make it a comment in the last patch.

Just added this comment:

+               /*
+                * bdi_thresh is not treated as some limiting factor as
+                * dirty_thresh, due to reasons
+                * - in JBOD setup, bdi_thresh can fluctuate a lot
+                * - in a system with HDD and USB key, the USB key may somehow
+                *   go into state (bdi_dirty >> bdi_thresh) either because
+                *   bdi_dirty starts high, or because bdi_thresh drops low.
+                *   In this case we don't want to hard throttle the USB key
+                *   dirtiers for 100 seconds until bdi_dirty drops under
+                *   bdi_thresh. Instead the auxiliary bdi control line in
+                *   bdi_position_ratio() will let the dirtier task progress
+                *   at some rate <= (write_bw / 2) for bringing down bdi_dirty.
+                */
                bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-18  4:18           ` Wu Fengguang
@ 2011-08-18 19:16             ` Jan Kara
  -1 siblings, 0 replies; 203+ messages in thread
From: Jan Kara @ 2011-08-18 19:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Peter Zijlstra, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Thu 18-08-11 12:18:01, Wu Fengguang wrote:
> > > > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > > > +	 */
> > > > > +	setpoint = (freerun + limit) / 2;
> > > > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > > > +		    limit - setpoint + 1);
> > > > > +	pos_ratio = x;
> > > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > > > +
> > > > > +	/*
> > > > > +	 * bdi setpoint
> >   OK, so if I understand the code right, we now have basic pos_ratio based
> > on global situation. Now, in the following code, we might scale pos_ratio
> > further down, if bdi_dirty is too much over bdi's share, right?
> 
> Right.
> 
> > Do we also want to scale pos_ratio up, if we are under bdi's share?
> 
> Yes.
> 
> > If yes, do we really want to do it even if global pos_ratio < 1
> > (i.e. we are over global setpoint)?
> 
> Yes. It's safe because the bdi pos_ratio scale is linear and the
> global pos_ratio scale will quickly drop to 0 near @limit, thus
> counter-acting any > 1 bdi pos_ratio.
  OK. I just wanted to make sure I understand it right :-). I can see
arguments for all the different choices so let's see how it works in
practice...

> > > > > +	 *
> > > > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> >                   ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
> > bdi_setpoint to distinguish clearly from the global value.
> 
> OK. I'll add a new variable bdi_setpoint, too, to make it consistent
> all over the places.
> 
> > > > > +	 *
> > > > > +	 * The main bdi control line is a linear function that subjects to
> > > > > +	 *
> > > > > +	 * (1) f(setpoint) = 1.0
> > > > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > > > +	 *
> > > > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > > > +	 * regularly within range
> > > > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > > > +	 * fluctuation range for pos_ratio.
> > > > > +	 *
> > > > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > > > +	 * own size, so move the slope over accordingly.
> > > > > +	 */
> > > > > +	if (unlikely(bdi_thresh > thresh))
> > > > > +		bdi_thresh = thresh;
> > > > > +	/*
> > > > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > > > +	 */
> > > > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > > > +	setpoint = setpoint * (u64)x >> 16;
> > > > > +	/*
> > > > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > > > +	 */
> > > > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > > > +		       thresh + 1);
> > > >   I think you can slightly simplify this to:
> > > > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> > > 
> > > Good idea!
> > > 
> > > > > +	x_intercept = setpoint + 2 * span;
> >    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> > ~3*bdi_thresh...
> 
> Right.
> 
> > So maybe you should use bdi_thresh/2 in the computation of span?
> 
> Given that at some configurations bdi_thresh can fluctuate to its own
> size, I guess the current slope of control line is sharp enough.
> 
> Given equations
> 
>         span = (x_intercept - bdi_setpoint) / 2
>         k = df/dx = -0.5 / span
> 
> and the values
> 
>         span = bdi_thresh
>         dx = bdi_thresh
> 
> we get
> 
>         df = - dx / (2 * span) = - 1/2
> 
> That means, when bdi_dirty deviates bdi_thresh apart, pos_ratio and
> hence task ratelimit will fluctuate by -1/2. This is probably more
> than the users can tolerate already?
  OK, let's try that.

> ---
> Subject: writeback: dirty position control
> Date: Wed Mar 02 16:04:18 CST 2011
> 
> bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> that the resulted task rate limit can drive the dirty pages back to the
> global/bdi setpoints.
> 
> Old scheme is,
>                                           |
>                            free run area  |  throttle area
>   ----------------------------------------+---------------------------->
>                                     thresh^                  dirty pages
> 
> New scheme is,
> 
>   ^ task rate limit
>   |
>   |            *
>   |             *
>   |              *
>   |[free run]      *      [smooth throttled]
>   |                  *
>   |                     *
>   |                         *
>   ..bdi->dirty_ratelimit..........*
>   |                               .     *
>   |                               .          *
>   |                               .              *
>   |                               .                 *
>   |                               .                    *
>   +-------------------------------.-----------------------*------------>
>                           setpoint^                  limit^  dirty pages
> 
> The slope of the bdi control line should be
> 
> 1) large enough to pull the dirty pages to setpoint reasonably fast
> 
> 2) small enough to avoid big fluctuations in the resulted pos_ratio and
>    hence task ratelimit
> 
> Since the fluctuation range of the bdi dirty pages is typically observed
> to be within 1-second worth of data, the bdi control line's slope is
> selected to be a linear function of bdi write bandwidth, so that it can
> adapt to slow/fast storage devices well.
> 
> Assume the bdi control line
> 
> 	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)
> 
> where k is the negative slope.
> 
> If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
> are fluctuating in range
> 
> 	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],
> 
> we get slope
> 
> 	k = - 1 / (8 * write_bw)
> 
> Let pos_ratio(x_intercept) = 0, we get the parameter used in code:
> 
> 	x_intercept = bdi_setpoint + 8 * write_bw
> 
> The global/bdi slopes are nicely complementing each other when the
> system has only one major bdi (indicated by bdi_thresh ~= thresh):
> 
> 1) slope of global control line    => scaling to the control scope size
> 2) slope of main bdi control line  => scaling to the write bandwidth
> 
> so that
> 
> - in memory tight systems, (1) becomes strong enough to squeeze dirty
>   pages inside the control scope
> 
> - in large memory systems where the "gravity" of (1) for pulling the
>   dirty pages to setpoint is too weak, (2) can back (1) up and drive
>   dirty pages to bdi_setpoint ~= setpoint reasonably fast.
> 
> Unfortunately in JBOD setups, the fluctuation range of bdi threshold
> is related to memory size due to the interferences between disks.  In
> this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.
> 
> peter: use 3rd order polynomial for the global control line
> 
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  OK, I like this patch now. You can add
Acked-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/fs-writeback.c         |    2 
>  include/linux/writeback.h |    1 
>  mm/page-writeback.c       |  212 +++++++++++++++++++++++++++++++++++-
>  3 files changed, 209 insertions(+), 6 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-18 12:15:24.000000000 +0800
> @@ -46,6 +46,8 @@
>   */
>  #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
>  
> +#define RATELIMIT_CALC_SHIFT	10
> +
>  /*
>   * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
>   * will look to see if it needs to force writeback or throttling.
> @@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
>  	return x + 1;	/* Ensure that we never return 0 */
>  }
>  
> +static unsigned long dirty_freerun_ceiling(unsigned long thresh,
> +					   unsigned long bg_thresh)
> +{
> +	return (thresh + bg_thresh) / 2;
> +}
> +
>  static unsigned long hard_dirty_limit(unsigned long thresh)
>  {
>  	return max(thresh, global_dirty_limit);
> @@ -495,6 +503,196 @@ unsigned long bdi_dirty_limit(struct bac
>  	return bdi_dirty;
>  }
>  
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + * We want the dirty pages be balanced around the global/bdi setpoints.
> + * When the number of dirty pages is higher/lower than the setpoint, the
> + * dirty position control ratio (and hence task dirty ratelimit) will be
> + * decreased/increased to bring the dirty pages back to the setpoint.
> + *
> + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> + *
> + *     if (dirty < setpoint) scale up   pos_ratio
> + *     if (dirty > setpoint) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> + *
> + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> + *
> + * (o) global control line
> + *
> + *     ^ pos_ratio
> + *     |
> + *     |            |<===== global dirty control scope ======>|
> + * 2.0 .............*
> + *     |            .*
> + *     |            . *
> + *     |            .   *
> + *     |            .     *
> + *     |            .        *
> + *     |            .            *
> + * 1.0 ................................*
> + *     |            .                  .     *
> + *     |            .                  .          *
> + *     |            .                  .              *
> + *     |            .                  .                 *
> + *     |            .                  .                    *
> + *   0 +------------.------------------.----------------------*------------->
> + *           freerun^          setpoint^                 limit^   dirty pages
> + *
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, rate scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, rate scale = 1/2
> + *                       |<-- span --->| .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0              bdi_setpoint                    x_intercept           limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* dirty pages' target balance point */
> +	unsigned long bdi_setpoint;
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                           setpoint - dirty 3
> +	 *        f(dirty) := 1.0 + (----------------)
> +	 *                           limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx      <= 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * We have computed basic pos_ratio above based on global situation. If
> +	 * the bdi is over/under its share of dirty pages, we want to scale
> +	 * pos_ratio further down/up. That is done by the following policies:
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
> +	 * for various filesystems, so choose a slope that can yield in a
> +	 * reasonable 12.5% fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly and choose a slope that
> +	 * yields 50% pos_ratio fluctuation when bdi_thresh is suddenly doubled.
> +	 */
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
> +	 *
> +	 *                        x_intercept - bdi_dirty
> +	 *                     := --------------------------
> +	 *                        x_intercept - bdi_setpoint
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(bdi_setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:
> +	 * 	bdi_setpoint = setpoint * bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh + 1);
> +	bdi_setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 *
> +	 *        bdi_thresh                  thresh - bdi_thresh
> +	 * span = ---------- * (4*write_bw) + ------------------- * bdi_thresh
> +	 *          thresh                          thresh
> +	 */
> +	span = (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) *
> +								(u64)x >> 16;
> +	x_intercept = bdi_setpoint + 2 * span;
> +
> +	if (unlikely(bdi_dirty > bdi_setpoint + span)) {
> +		if (unlikely(bdi_dirty > limit))
> +			return 0;
> +		if (x_intercept < limit) {
> +			x_intercept = limit;	/* auxiliary control line */
> +			bdi_setpoint += span;
> +			pos_ratio >>= 1;
> +		}
> +	}
> +	pos_ratio *= x_intercept - bdi_dirty;
> +	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
> +
> +	return pos_ratio;
> +}
> +
>  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
>  				       unsigned long elapsed,
>  				       unsigned long written)
> @@ -593,6 +791,7 @@ static void global_update_bandwidth(unsi
>  
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
> +			    unsigned long bg_thresh,
>  			    unsigned long dirty,
>  			    unsigned long bdi_thresh,
>  			    unsigned long bdi_dirty,
> @@ -629,6 +828,7 @@ snapshot:
>  
>  static void bdi_update_bandwidth(struct backing_dev_info *bdi,
>  				 unsigned long thresh,
> +				 unsigned long bg_thresh,
>  				 unsigned long dirty,
>  				 unsigned long bdi_thresh,
>  				 unsigned long bdi_dirty,
> @@ -637,8 +837,8 @@ static void bdi_update_bandwidth(struct 
>  	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
>  		return;
>  	spin_lock(&bdi->wb.list_lock);
> -	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
> -			       start_time);
> +	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
> +			       bdi_thresh, bdi_dirty, start_time);
>  	spin_unlock(&bdi->wb.list_lock);
>  }
>  
> @@ -679,7 +879,8 @@ static void balance_dirty_pages(struct a
>  		 * catch-up. This avoids (excessively) small writeouts
>  		 * when the bdi limits are ramping up.
>  		 */
> -		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
> +		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
> +						      background_thresh))
>  			break;
>  
>  		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> @@ -723,8 +924,9 @@ static void balance_dirty_pages(struct a
>  		if (!bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
> -		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
> -				     bdi_thresh, bdi_dirty, start_time);
> +		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
> +				     nr_dirty, bdi_thresh, bdi_dirty,
> +				     start_time);
>  
>  		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
>  		 * Unstable writes are a feature of certain networked
> --- linux-next.orig/fs/fs-writeback.c	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-08-17 20:35:34.000000000 +0800
> @@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
>  static void wb_update_bandwidth(struct bdi_writeback *wb,
>  				unsigned long start_time)
>  {
> -	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
> +	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
>  }
>  
>  /*
> --- linux-next.orig/include/linux/writeback.h	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-08-17 20:35:34.000000000 +0800
> @@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
>  
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
> +			    unsigned long bg_thresh,
>  			    unsigned long dirty,
>  			    unsigned long bdi_thresh,
>  			    unsigned long bdi_dirty,
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-18 19:16             ` Jan Kara
  0 siblings, 0 replies; 203+ messages in thread
From: Jan Kara @ 2011-08-18 19:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Peter Zijlstra, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Thu 18-08-11 12:18:01, Wu Fengguang wrote:
> > > > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > > > +	 */
> > > > > +	setpoint = (freerun + limit) / 2;
> > > > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > > > +		    limit - setpoint + 1);
> > > > > +	pos_ratio = x;
> > > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > > > +
> > > > > +	/*
> > > > > +	 * bdi setpoint
> >   OK, so if I understand the code right, we now have basic pos_ratio based
> > on global situation. Now, in the following code, we might scale pos_ratio
> > further down, if bdi_dirty is too much over bdi's share, right?
> 
> Right.
> 
> > Do we also want to scale pos_ratio up, if we are under bdi's share?
> 
> Yes.
> 
> > If yes, do we really want to do it even if global pos_ratio < 1
> > (i.e. we are over global setpoint)?
> 
> Yes. It's safe because the bdi pos_ratio scale is linear and the
> global pos_ratio scale will quickly drop to 0 near @limit, thus
> counter-acting any > 1 bdi pos_ratio.
  OK. I just wanted to make sure I understand it right :-). I can see
arguments for all the different choices so let's see how it works in
practice...

> > > > > +	 *
> > > > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> >                   ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
> > bdi_setpoint to distinguish clearly from the global value.
> 
> OK. I'll add a new variable bdi_setpoint, too, to make it consistent
> all over the places.
> 
> > > > > +	 *
> > > > > +	 * The main bdi control line is a linear function that subjects to
> > > > > +	 *
> > > > > +	 * (1) f(setpoint) = 1.0
> > > > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > > > +	 *
> > > > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > > > +	 * regularly within range
> > > > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > > > +	 * fluctuation range for pos_ratio.
> > > > > +	 *
> > > > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > > > +	 * own size, so move the slope over accordingly.
> > > > > +	 */
> > > > > +	if (unlikely(bdi_thresh > thresh))
> > > > > +		bdi_thresh = thresh;
> > > > > +	/*
> > > > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > > > +	 */
> > > > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > > > +	setpoint = setpoint * (u64)x >> 16;
> > > > > +	/*
> > > > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > > > +	 */
> > > > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > > > +		       thresh + 1);
> > > >   I think you can slightly simplify this to:
> > > > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> > > 
> > > Good idea!
> > > 
> > > > > +	x_intercept = setpoint + 2 * span;
> >    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> > ~3*bdi_thresh...
> 
> Right.
> 
> > So maybe you should use bdi_thresh/2 in the computation of span?
> 
> Given that at some configurations bdi_thresh can fluctuate to its own
> size, I guess the current slope of control line is sharp enough.
> 
> Given equations
> 
>         span = (x_intercept - bdi_setpoint) / 2
>         k = df/dx = -0.5 / span
> 
> and the values
> 
>         span = bdi_thresh
>         dx = bdi_thresh
> 
> we get
> 
>         df = - dx / (2 * span) = - 1/2
> 
> That means, when bdi_dirty deviates bdi_thresh apart, pos_ratio and
> hence task ratelimit will fluctuate by -1/2. This is probably more
> than the users can tolerate already?
  OK, let's try that.

> ---
> Subject: writeback: dirty position control
> Date: Wed Mar 02 16:04:18 CST 2011
> 
> bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> that the resulted task rate limit can drive the dirty pages back to the
> global/bdi setpoints.
> 
> Old scheme is,
>                                           |
>                            free run area  |  throttle area
>   ----------------------------------------+---------------------------->
>                                     thresh^                  dirty pages
> 
> New scheme is,
> 
>   ^ task rate limit
>   |
>   |            *
>   |             *
>   |              *
>   |[free run]      *      [smooth throttled]
>   |                  *
>   |                     *
>   |                         *
>   ..bdi->dirty_ratelimit..........*
>   |                               .     *
>   |                               .          *
>   |                               .              *
>   |                               .                 *
>   |                               .                    *
>   +-------------------------------.-----------------------*------------>
>                           setpoint^                  limit^  dirty pages
> 
> The slope of the bdi control line should be
> 
> 1) large enough to pull the dirty pages to setpoint reasonably fast
> 
> 2) small enough to avoid big fluctuations in the resulted pos_ratio and
>    hence task ratelimit
> 
> Since the fluctuation range of the bdi dirty pages is typically observed
> to be within 1-second worth of data, the bdi control line's slope is
> selected to be a linear function of bdi write bandwidth, so that it can
> adapt to slow/fast storage devices well.
> 
> Assume the bdi control line
> 
> 	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)
> 
> where k is the negative slope.
> 
> If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
> are fluctuating in range
> 
> 	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],
> 
> we get slope
> 
> 	k = - 1 / (8 * write_bw)
> 
> Let pos_ratio(x_intercept) = 0, we get the parameter used in code:
> 
> 	x_intercept = bdi_setpoint + 8 * write_bw
> 
> The global/bdi slopes are nicely complementing each other when the
> system has only one major bdi (indicated by bdi_thresh ~= thresh):
> 
> 1) slope of global control line    => scaling to the control scope size
> 2) slope of main bdi control line  => scaling to the write bandwidth
> 
> so that
> 
> - in memory tight systems, (1) becomes strong enough to squeeze dirty
>   pages inside the control scope
> 
> - in large memory systems where the "gravity" of (1) for pulling the
>   dirty pages to setpoint is too weak, (2) can back (1) up and drive
>   dirty pages to bdi_setpoint ~= setpoint reasonably fast.
> 
> Unfortunately in JBOD setups, the fluctuation range of bdi threshold
> is related to memory size due to the interferences between disks.  In
> this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.
> 
> peter: use 3rd order polynomial for the global control line
> 
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  OK, I like this patch now. You can add
Acked-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/fs-writeback.c         |    2 
>  include/linux/writeback.h |    1 
>  mm/page-writeback.c       |  212 +++++++++++++++++++++++++++++++++++-
>  3 files changed, 209 insertions(+), 6 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-18 12:15:24.000000000 +0800
> @@ -46,6 +46,8 @@
>   */
>  #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
>  
> +#define RATELIMIT_CALC_SHIFT	10
> +
>  /*
>   * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
>   * will look to see if it needs to force writeback or throttling.
> @@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
>  	return x + 1;	/* Ensure that we never return 0 */
>  }
>  
> +static unsigned long dirty_freerun_ceiling(unsigned long thresh,
> +					   unsigned long bg_thresh)
> +{
> +	return (thresh + bg_thresh) / 2;
> +}
> +
>  static unsigned long hard_dirty_limit(unsigned long thresh)
>  {
>  	return max(thresh, global_dirty_limit);
> @@ -495,6 +503,196 @@ unsigned long bdi_dirty_limit(struct bac
>  	return bdi_dirty;
>  }
>  
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + * We want the dirty pages be balanced around the global/bdi setpoints.
> + * When the number of dirty pages is higher/lower than the setpoint, the
> + * dirty position control ratio (and hence task dirty ratelimit) will be
> + * decreased/increased to bring the dirty pages back to the setpoint.
> + *
> + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> + *
> + *     if (dirty < setpoint) scale up   pos_ratio
> + *     if (dirty > setpoint) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> + *
> + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> + *
> + * (o) global control line
> + *
> + *     ^ pos_ratio
> + *     |
> + *     |            |<===== global dirty control scope ======>|
> + * 2.0 .............*
> + *     |            .*
> + *     |            . *
> + *     |            .   *
> + *     |            .     *
> + *     |            .        *
> + *     |            .            *
> + * 1.0 ................................*
> + *     |            .                  .     *
> + *     |            .                  .          *
> + *     |            .                  .              *
> + *     |            .                  .                 *
> + *     |            .                  .                    *
> + *   0 +------------.------------------.----------------------*------------->
> + *           freerun^          setpoint^                 limit^   dirty pages
> + *
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, rate scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, rate scale = 1/2
> + *                       |<-- span --->| .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0              bdi_setpoint                    x_intercept           limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* dirty pages' target balance point */
> +	unsigned long bdi_setpoint;
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                           setpoint - dirty 3
> +	 *        f(dirty) := 1.0 + (----------------)
> +	 *                           limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx      <= 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * We have computed basic pos_ratio above based on global situation. If
> +	 * the bdi is over/under its share of dirty pages, we want to scale
> +	 * pos_ratio further down/up. That is done by the following policies:
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
> +	 * for various filesystems, so choose a slope that can yield in a
> +	 * reasonable 12.5% fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly and choose a slope that
> +	 * yields 50% pos_ratio fluctuation when bdi_thresh is suddenly doubled.
> +	 */
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
> +	 *
> +	 *                        x_intercept - bdi_dirty
> +	 *                     := --------------------------
> +	 *                        x_intercept - bdi_setpoint
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(bdi_setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:
> +	 * 	bdi_setpoint = setpoint * bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh + 1);
> +	bdi_setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 *
> +	 *        bdi_thresh                  thresh - bdi_thresh
> +	 * span = ---------- * (4*write_bw) + ------------------- * bdi_thresh
> +	 *          thresh                          thresh
> +	 */
> +	span = (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) *
> +								(u64)x >> 16;
> +	x_intercept = bdi_setpoint + 2 * span;
> +
> +	if (unlikely(bdi_dirty > bdi_setpoint + span)) {
> +		if (unlikely(bdi_dirty > limit))
> +			return 0;
> +		if (x_intercept < limit) {
> +			x_intercept = limit;	/* auxiliary control line */
> +			bdi_setpoint += span;
> +			pos_ratio >>= 1;
> +		}
> +	}
> +	pos_ratio *= x_intercept - bdi_dirty;
> +	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
> +
> +	return pos_ratio;
> +}
> +
>  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
>  				       unsigned long elapsed,
>  				       unsigned long written)
> @@ -593,6 +791,7 @@ static void global_update_bandwidth(unsi
>  
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
> +			    unsigned long bg_thresh,
>  			    unsigned long dirty,
>  			    unsigned long bdi_thresh,
>  			    unsigned long bdi_dirty,
> @@ -629,6 +828,7 @@ snapshot:
>  
>  static void bdi_update_bandwidth(struct backing_dev_info *bdi,
>  				 unsigned long thresh,
> +				 unsigned long bg_thresh,
>  				 unsigned long dirty,
>  				 unsigned long bdi_thresh,
>  				 unsigned long bdi_dirty,
> @@ -637,8 +837,8 @@ static void bdi_update_bandwidth(struct 
>  	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
>  		return;
>  	spin_lock(&bdi->wb.list_lock);
> -	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
> -			       start_time);
> +	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
> +			       bdi_thresh, bdi_dirty, start_time);
>  	spin_unlock(&bdi->wb.list_lock);
>  }
>  
> @@ -679,7 +879,8 @@ static void balance_dirty_pages(struct a
>  		 * catch-up. This avoids (excessively) small writeouts
>  		 * when the bdi limits are ramping up.
>  		 */
> -		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
> +		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
> +						      background_thresh))
>  			break;
>  
>  		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> @@ -723,8 +924,9 @@ static void balance_dirty_pages(struct a
>  		if (!bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
> -		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
> -				     bdi_thresh, bdi_dirty, start_time);
> +		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
> +				     nr_dirty, bdi_thresh, bdi_dirty,
> +				     start_time);
>  
>  		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
>  		 * Unstable writes are a feature of certain networked
> --- linux-next.orig/fs/fs-writeback.c	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-08-17 20:35:34.000000000 +0800
> @@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
>  static void wb_update_bandwidth(struct bdi_writeback *wb,
>  				unsigned long start_time)
>  {
> -	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
> +	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
>  }
>  
>  /*
> --- linux-next.orig/include/linux/writeback.h	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-08-17 20:35:34.000000000 +0800
> @@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
>  
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
> +			    unsigned long bg_thresh,
>  			    unsigned long dirty,
>  			    unsigned long bdi_thresh,
>  			    unsigned long bdi_dirty,
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-16  2:20   ` Wu Fengguang
@ 2011-08-19  2:06     ` Vivek Goyal
  -1 siblings, 0 replies; 203+ messages in thread
From: Vivek Goyal @ 2011-08-19  2:06 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 16, 2011 at 10:20:11AM +0800, Wu Fengguang wrote:

[..]
> +		if (dirty_exceeded && !bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
>  		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
>  				     nr_dirty, bdi_thresh, bdi_dirty,
>  				     start_time);
>  
> -		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> -		 * Unstable writes are a feature of certain networked
> -		 * filesystems (i.e. NFS) in which data may have been
> -		 * written to the server's write cache, but has not yet
> -		 * been flushed to permanent storage.
> -		 * Only move pages to writeback if this bdi is over its
> -		 * threshold otherwise wait until the disk writes catch
> -		 * up.
> -		 */
> -		trace_balance_dirty_start(bdi);
> -		if (bdi_nr_reclaimable > task_bdi_thresh) {
> -			pages_written += writeback_inodes_wb(&bdi->wb,
> -							     write_chunk);
> -			trace_balance_dirty_written(bdi, pages_written);
> -			if (pages_written >= write_chunk)
> -				break;		/* We've done our duty */
> +		if (unlikely(!writeback_in_progress(bdi)))
> +			bdi_start_background_writeback(bdi);
> +
> +		base_rate = bdi->dirty_ratelimit;
> +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> +					       background_thresh, nr_dirty,
> +					       bdi_thresh, bdi_dirty);
> +		if (unlikely(pos_ratio == 0)) {
> +			pause = MAX_PAUSE;
> +			goto pause;
>  		}
> +		task_ratelimit = (u64)base_rate *
> +					pos_ratio >> RATELIMIT_CALC_SHIFT;

Hi Fenguaang,

I am little confused here. I see that you have already taken pos_ratio
into account in bdi_update_dirty_ratelimit() and wondering why to take
that into account again in balance_diry_pages().

We calculated the pos_rate and balanced_rate and adjusted the
bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().

So why are we adjusting this pos_ratio() adjusted limit again with
pos_ratio(). Doesn't it become effectively following (assuming
one is decreasing the dirty rate limit).

base_rate = bdi->dirty_ratelimit
pos_rate = base_rate * pos_ratio();

			  write_bw
balance_rate = pos_rate * --------
			  dirty_bw

delta = max(pos_rate, balance_rate)
bdi->dirty_ratelimit = bdi->dirty_ratelimit - delta;

task_ratelimit = bdi->dirty_ratelimit * pos_ratio().

So we have already taken into account pos_ratio() while calculating new
bdi->dirty_ratelimit. Do we need to take that into account again.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-19  2:06     ` Vivek Goyal
  0 siblings, 0 replies; 203+ messages in thread
From: Vivek Goyal @ 2011-08-19  2:06 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 16, 2011 at 10:20:11AM +0800, Wu Fengguang wrote:

[..]
> +		if (dirty_exceeded && !bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
>  		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
>  				     nr_dirty, bdi_thresh, bdi_dirty,
>  				     start_time);
>  
> -		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> -		 * Unstable writes are a feature of certain networked
> -		 * filesystems (i.e. NFS) in which data may have been
> -		 * written to the server's write cache, but has not yet
> -		 * been flushed to permanent storage.
> -		 * Only move pages to writeback if this bdi is over its
> -		 * threshold otherwise wait until the disk writes catch
> -		 * up.
> -		 */
> -		trace_balance_dirty_start(bdi);
> -		if (bdi_nr_reclaimable > task_bdi_thresh) {
> -			pages_written += writeback_inodes_wb(&bdi->wb,
> -							     write_chunk);
> -			trace_balance_dirty_written(bdi, pages_written);
> -			if (pages_written >= write_chunk)
> -				break;		/* We've done our duty */
> +		if (unlikely(!writeback_in_progress(bdi)))
> +			bdi_start_background_writeback(bdi);
> +
> +		base_rate = bdi->dirty_ratelimit;
> +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> +					       background_thresh, nr_dirty,
> +					       bdi_thresh, bdi_dirty);
> +		if (unlikely(pos_ratio == 0)) {
> +			pause = MAX_PAUSE;
> +			goto pause;
>  		}
> +		task_ratelimit = (u64)base_rate *
> +					pos_ratio >> RATELIMIT_CALC_SHIFT;

Hi Fenguaang,

I am little confused here. I see that you have already taken pos_ratio
into account in bdi_update_dirty_ratelimit() and wondering why to take
that into account again in balance_diry_pages().

We calculated the pos_rate and balanced_rate and adjusted the
bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().

So why are we adjusting this pos_ratio() adjusted limit again with
pos_ratio(). Doesn't it become effectively following (assuming
one is decreasing the dirty rate limit).

base_rate = bdi->dirty_ratelimit
pos_rate = base_rate * pos_ratio();

			  write_bw
balance_rate = pos_rate * --------
			  dirty_bw

delta = max(pos_rate, balance_rate)
bdi->dirty_ratelimit = bdi->dirty_ratelimit - delta;

task_ratelimit = bdi->dirty_ratelimit * pos_ratio().

So we have already taken into account pos_ratio() while calculating new
bdi->dirty_ratelimit. Do we need to take that into account again.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-16  2:20   ` Wu Fengguang
@ 2011-08-19  2:53     ` Vivek Goyal
  -1 siblings, 0 replies; 203+ messages in thread
From: Vivek Goyal @ 2011-08-19  2:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 16, 2011 at 10:20:08AM +0800, Wu Fengguang wrote:

[..]
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + * We want the dirty pages be balanced around the global/bdi setpoints.
> + * When the number of dirty pages is higher/lower than the setpoint, the
> + * dirty position control ratio (and hence task dirty ratelimit) will be
> + * decreased/increased to bring the dirty pages back to the setpoint.
> + *
> + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> + *
> + *     if (dirty < setpoint) scale up   pos_ratio
> + *     if (dirty > setpoint) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> + *
> + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> + *
> + * (o) global control line
> + *
> + *     ^ pos_ratio
> + *     |
> + *     |            |<===== global dirty control scope ======>|
> + * 2.0 .............*
> + *     |            .*
> + *     |            . *
> + *     |            .   *
> + *     |            .     *
> + *     |            .        *
> + *     |            .            *
> + * 1.0 ................................*
> + *     |            .                  .     *
> + *     |            .                  .          *
> + *     |            .                  .              *
> + *     |            .                  .                 *
> + *     |            .                  .                    *
> + *   0 +------------.------------------.----------------------*------------->
> + *           freerun^          setpoint^                 limit^   dirty pages
> + *
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, rate scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, rate scale = 1/2
> + *                       |               .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0                 setpoint                     x_intercept           limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to many times higher than bdi setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* the target balance point */
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                         setpoint - dirty 3
> +	 *        f(dirty) := 1 + (----------------)
> +	 *                         limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx       < 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> +	 * fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly.
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> +	setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 */
> +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> +		       thresh + 1);
> +	x_intercept = setpoint + 2 * span;
> +

Hi Fengguang,

Few very basic queries.

- Why can't we use the same formula for bdi position ratio as gloabl
  position ratio. Are you not looking for similar proporties. Near the
  set point variation is less and away from setup poing throttling is
  faster.

- In the bdi calculation, setpoint seems to be in number of pages and 
  limit (x_intercept) seems to be a combination of nr pages + pages/sec.
  Why it is different from gloabl setpoint and limit. I mean could this
  not have been like global calculation where we try to keep bdi_dirty
  close to bdi_thresh and calculate pos_ratio. 

- In global pos_ratio calculation terminology used is "limit" while
  the same thing seems be being meintioned as x_intercept in bdi position
  ratio calculation.

Am I missing something very basic here.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-19  2:53     ` Vivek Goyal
  0 siblings, 0 replies; 203+ messages in thread
From: Vivek Goyal @ 2011-08-19  2:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 16, 2011 at 10:20:08AM +0800, Wu Fengguang wrote:

[..]
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + * We want the dirty pages be balanced around the global/bdi setpoints.
> + * When the number of dirty pages is higher/lower than the setpoint, the
> + * dirty position control ratio (and hence task dirty ratelimit) will be
> + * decreased/increased to bring the dirty pages back to the setpoint.
> + *
> + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> + *
> + *     if (dirty < setpoint) scale up   pos_ratio
> + *     if (dirty > setpoint) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> + *
> + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> + *
> + * (o) global control line
> + *
> + *     ^ pos_ratio
> + *     |
> + *     |            |<===== global dirty control scope ======>|
> + * 2.0 .............*
> + *     |            .*
> + *     |            . *
> + *     |            .   *
> + *     |            .     *
> + *     |            .        *
> + *     |            .            *
> + * 1.0 ................................*
> + *     |            .                  .     *
> + *     |            .                  .          *
> + *     |            .                  .              *
> + *     |            .                  .                 *
> + *     |            .                  .                    *
> + *   0 +------------.------------------.----------------------*------------->
> + *           freerun^          setpoint^                 limit^   dirty pages
> + *
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, rate scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, rate scale = 1/2
> + *                       |               .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0                 setpoint                     x_intercept           limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to many times higher than bdi setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* the target balance point */
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                         setpoint - dirty 3
> +	 *        f(dirty) := 1 + (----------------)
> +	 *                         limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx       < 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> +	 * fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly.
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> +	setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 */
> +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> +		       thresh + 1);
> +	x_intercept = setpoint + 2 * span;
> +

Hi Fengguang,

Few very basic queries.

- Why can't we use the same formula for bdi position ratio as gloabl
  position ratio. Are you not looking for similar proporties. Near the
  set point variation is less and away from setup poing throttling is
  faster.

- In the bdi calculation, setpoint seems to be in number of pages and 
  limit (x_intercept) seems to be a combination of nr pages + pages/sec.
  Why it is different from gloabl setpoint and limit. I mean could this
  not have been like global calculation where we try to keep bdi_dirty
  close to bdi_thresh and calculate pos_ratio. 

- In global pos_ratio calculation terminology used is "limit" while
  the same thing seems be being meintioned as x_intercept in bdi position
  ratio calculation.

Am I missing something very basic here.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-19  2:06     ` Vivek Goyal
@ 2011-08-19  2:54       ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-19  2:54 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

Hi Vivek,

> > +		base_rate = bdi->dirty_ratelimit;
> > +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > +					       background_thresh, nr_dirty,
> > +					       bdi_thresh, bdi_dirty);
> > +		if (unlikely(pos_ratio == 0)) {
> > +			pause = MAX_PAUSE;
> > +			goto pause;
> >  		}
> > +		task_ratelimit = (u64)base_rate *
> > +					pos_ratio >> RATELIMIT_CALC_SHIFT;
> 
> Hi Fenguaang,
> 
> I am little confused here. I see that you have already taken pos_ratio
> into account in bdi_update_dirty_ratelimit() and wondering why to take
> that into account again in balance_diry_pages().
> 
> We calculated the pos_rate and balanced_rate and adjusted the
> bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().

Good question. There are some inter-dependencies in the calculation,
and the dependency chain is the opposite to the one in your mind:
balance_dirty_pages() used pos_ratio in the first place, so that
bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation
of the balanced dirty rate, too.

Let's return to how the balanced dirty rate is estimated. Please pay
special attention to the last paragraphs below the "......" line.

Start by throttling each dd task at rate

        task_ratelimit = task_ratelimit_0                               (1)
                         (any non-zero initial value is OK)

After 200ms, we measured

        dirty_rate = # of pages dirtied by all dd's / 200ms
        write_bw   = # of pages written to the disk / 200ms

For the aggressive dd dirtiers, the equality holds

        dirty_rate == N * task_rate
                   == N * task_ratelimit
                   == N * task_ratelimit_0                              (2)
Or     
        task_ratelimit_0 = dirty_rate / N                               (3)

Now we conclude that the balanced task ratelimit can be estimated by

        balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (4)

Because with (2) and (3), (4) yields the desired equality (1):

        balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
                      == write_bw / N

.............................................................................

Now let's revisit (1). Since balance_dirty_pages() chooses to execute
the ratelimit

        task_ratelimit = task_ratelimit_0
                       = dirty_ratelimit * pos_ratio                    (5)

Put (5) into (4), we get the final form used in
bdi_update_dirty_ratelimit()

        balanced_rate = (dirty_ratelimit * pos_ratio) * (write_bw / dirty_rate)

So you really need to take (dirty_ratelimit * pos_ratio) as a single entity.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-19  2:54       ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-19  2:54 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

Hi Vivek,

> > +		base_rate = bdi->dirty_ratelimit;
> > +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > +					       background_thresh, nr_dirty,
> > +					       bdi_thresh, bdi_dirty);
> > +		if (unlikely(pos_ratio == 0)) {
> > +			pause = MAX_PAUSE;
> > +			goto pause;
> >  		}
> > +		task_ratelimit = (u64)base_rate *
> > +					pos_ratio >> RATELIMIT_CALC_SHIFT;
> 
> Hi Fenguaang,
> 
> I am little confused here. I see that you have already taken pos_ratio
> into account in bdi_update_dirty_ratelimit() and wondering why to take
> that into account again in balance_diry_pages().
> 
> We calculated the pos_rate and balanced_rate and adjusted the
> bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().

Good question. There are some inter-dependencies in the calculation,
and the dependency chain is the opposite to the one in your mind:
balance_dirty_pages() used pos_ratio in the first place, so that
bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation
of the balanced dirty rate, too.

Let's return to how the balanced dirty rate is estimated. Please pay
special attention to the last paragraphs below the "......" line.

Start by throttling each dd task at rate

        task_ratelimit = task_ratelimit_0                               (1)
                         (any non-zero initial value is OK)

After 200ms, we measured

        dirty_rate = # of pages dirtied by all dd's / 200ms
        write_bw   = # of pages written to the disk / 200ms

For the aggressive dd dirtiers, the equality holds

        dirty_rate == N * task_rate
                   == N * task_ratelimit
                   == N * task_ratelimit_0                              (2)
Or     
        task_ratelimit_0 = dirty_rate / N                               (3)

Now we conclude that the balanced task ratelimit can be estimated by

        balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (4)

Because with (2) and (3), (4) yields the desired equality (1):

        balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
                      == write_bw / N

.............................................................................

Now let's revisit (1). Since balance_dirty_pages() chooses to execute
the ratelimit

        task_ratelimit = task_ratelimit_0
                       = dirty_ratelimit * pos_ratio                    (5)

Put (5) into (4), we get the final form used in
bdi_update_dirty_ratelimit()

        balanced_rate = (dirty_ratelimit * pos_ratio) * (write_bw / dirty_rate)

So you really need to take (dirty_ratelimit * pos_ratio) as a single entity.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-19  2:53     ` Vivek Goyal
@ 2011-08-19  3:25       ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-19  3:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 19, 2011 at 10:53:21AM +0800, Vivek Goyal wrote:
> On Tue, Aug 16, 2011 at 10:20:08AM +0800, Wu Fengguang wrote:
> 
> [..]
> > +/*
> > + * Dirty position control.
> > + *
> > + * (o) global/bdi setpoints
> > + *
> > + * We want the dirty pages be balanced around the global/bdi setpoints.
> > + * When the number of dirty pages is higher/lower than the setpoint, the
> > + * dirty position control ratio (and hence task dirty ratelimit) will be
> > + * decreased/increased to bring the dirty pages back to the setpoint.
> > + *
> > + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> > + *
> > + *     if (dirty < setpoint) scale up   pos_ratio
> > + *     if (dirty > setpoint) scale down pos_ratio
> > + *
> > + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> > + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> > + *
> > + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> > + *
> > + * (o) global control line
> > + *
> > + *     ^ pos_ratio
> > + *     |
> > + *     |            |<===== global dirty control scope ======>|
> > + * 2.0 .............*
> > + *     |            .*
> > + *     |            . *
> > + *     |            .   *
> > + *     |            .     *
> > + *     |            .        *
> > + *     |            .            *
> > + * 1.0 ................................*
> > + *     |            .                  .     *
> > + *     |            .                  .          *
> > + *     |            .                  .              *
> > + *     |            .                  .                 *
> > + *     |            .                  .                    *
> > + *   0 +------------.------------------.----------------------*------------->
> > + *           freerun^          setpoint^                 limit^   dirty pages
> > + *
> > + * (o) bdi control lines
> > + *
> > + * The control lines for the global/bdi setpoints both stretch up to @limit.
> > + * The below figure illustrates the main bdi control line with an auxiliary
> > + * line extending it to @limit.
> > + *
> > + *   o
> > + *     o
> > + *       o                                      [o] main control line
> > + *         o                                    [*] auxiliary control line
> > + *           o
> > + *             o
> > + *               o
> > + *                 o
> > + *                   o
> > + *                     o
> > + *                       o--------------------- balance point, rate scale = 1
> > + *                       | o
> > + *                       |   o
> > + *                       |     o
> > + *                       |       o
> > + *                       |         o
> > + *                       |           o
> > + *                       |             o------- connect point, rate scale = 1/2
> > + *                       |               .*
> > + *                       |                 .   *
> > + *                       |                   .      *
> > + *                       |                     .         *
> > + *                       |                       .           *
> > + *                       |                         .              *
> > + *                       |                           .                 *
> > + *  [--------------------+-----------------------------.--------------------*]
> > + *  0                 setpoint                     x_intercept           limit
> > + *
> > + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> > + * normal if it starts high in situations like
> > + * - start writing to a slow SD card and a fast disk at the same time. The SD
> > + *   card's bdi_dirty may rush to many times higher than bdi setpoint.
> > + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> > + */
> > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > +					unsigned long thresh,
> > +					unsigned long bg_thresh,
> > +					unsigned long dirty,
> > +					unsigned long bdi_thresh,
> > +					unsigned long bdi_dirty)
> > +{
> > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > +	unsigned long limit = hard_dirty_limit(thresh);
> > +	unsigned long x_intercept;
> > +	unsigned long setpoint;		/* the target balance point */
> > +	unsigned long span;
> > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > +	long x;
> > +
> > +	if (unlikely(dirty >= limit))
> > +		return 0;
> > +
> > +	/*
> > +	 * global setpoint
> > +	 *
> > +	 *                         setpoint - dirty 3
> > +	 *        f(dirty) := 1 + (----------------)
> > +	 *                         limit - setpoint
> > +	 *
> > +	 * it's a 3rd order polynomial that subjects to
> > +	 *
> > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > +	 * (2) f(setpoint) = 1.0 => the balance point
> > +	 * (3) f(limit)    = 0   => the hard limit
> > +	 * (4) df/dx       < 0	 => negative feedback control
> > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > +	 *     => fast response on large errors; small oscillation near setpoint
> > +	 */
> > +	setpoint = (freerun + limit) / 2;
> > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > +		    limit - setpoint + 1);
> > +	pos_ratio = x;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > +
> > +	/*
> > +	 * bdi setpoint
> > +	 *
> > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> > +	 *
> > +	 * The main bdi control line is a linear function that subjects to
> > +	 *
> > +	 * (1) f(setpoint) = 1.0
> > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > +	 *
> > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > +	 * regularly within range
> > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > +	 * fluctuation range for pos_ratio.
> > +	 *
> > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > +	 * own size, so move the slope over accordingly.
> > +	 */
> > +	if (unlikely(bdi_thresh > thresh))
> > +		bdi_thresh = thresh;
> > +	/*
> > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > +	 */
> > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > +	setpoint = setpoint * (u64)x >> 16;
> > +	/*
> > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > +	 */
> > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > +		       thresh + 1);
> > +	x_intercept = setpoint + 2 * span;
> > +
> 
> Hi Fengguang,
> 
> Few very basic queries.
> 
> - Why can't we use the same formula for bdi position ratio as gloabl
>   position ratio. Are you not looking for similar proporties. Near the
>   set point variation is less and away from setup poing throttling is
>   faster.

The changelog has more details, however I hope the rephrased summary
can answer this question better.

Firstly, for single bdi case, the different bdi/global formula is
complementing each other, where the bdi's slope is proportional to the
writeout bandwidth, while the global one is scaling to memory size.
In huge memory system, the global position feedback becomes very weak
(even far away from the setpoint).  This is where the bdi control line
can help pull the dirty pages to the setpoint.

Secondly, for JBOD case, the global/bdi dirty thresholds are
fundamentally different. The global one is stable and strong limit,
while the bdi one is fluctuating and hence only suitable be taken as a
weak limit. The other reason to make it a weak limit is, there are
valid situations that (bdi_dirty >> bdi_thresh) and it's desirable to
throttle the dirtier in reasonable small rate rather than to hard
throttle it.

> - In the bdi calculation, setpoint seems to be in number of pages and 
>   limit (x_intercept) seems to be a combination of nr pages + pages/sec.
>   Why it is different from gloabl setpoint and limit. I mean could this
>   not have been like global calculation where we try to keep bdi_dirty
>   close to bdi_thresh and calculate pos_ratio. 

Because the bdi dirty pages are observed to typically fluctuate up to
1-second worth of data. So the write_bw used here is really (1s * write_bw).

> - In global pos_ratio calculation terminology used is "limit" while
>   the same thing seems be being meintioned as x_intercept in bdi position
>   ratio calculation.

Yes. Because the bdi control lines don't intent to do hard limit at all.

It's actually possible for x_intercept to become larger than the global limit.
This means the it's a memory tight system (or the storage is super fast)
where the bdi dirty pages will inevitably fluctuate a lot (up to write_bw).
We just let go of them and let the global formula take the control.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-19  3:25       ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-19  3:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 19, 2011 at 10:53:21AM +0800, Vivek Goyal wrote:
> On Tue, Aug 16, 2011 at 10:20:08AM +0800, Wu Fengguang wrote:
> 
> [..]
> > +/*
> > + * Dirty position control.
> > + *
> > + * (o) global/bdi setpoints
> > + *
> > + * We want the dirty pages be balanced around the global/bdi setpoints.
> > + * When the number of dirty pages is higher/lower than the setpoint, the
> > + * dirty position control ratio (and hence task dirty ratelimit) will be
> > + * decreased/increased to bring the dirty pages back to the setpoint.
> > + *
> > + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> > + *
> > + *     if (dirty < setpoint) scale up   pos_ratio
> > + *     if (dirty > setpoint) scale down pos_ratio
> > + *
> > + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> > + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> > + *
> > + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> > + *
> > + * (o) global control line
> > + *
> > + *     ^ pos_ratio
> > + *     |
> > + *     |            |<===== global dirty control scope ======>|
> > + * 2.0 .............*
> > + *     |            .*
> > + *     |            . *
> > + *     |            .   *
> > + *     |            .     *
> > + *     |            .        *
> > + *     |            .            *
> > + * 1.0 ................................*
> > + *     |            .                  .     *
> > + *     |            .                  .          *
> > + *     |            .                  .              *
> > + *     |            .                  .                 *
> > + *     |            .                  .                    *
> > + *   0 +------------.------------------.----------------------*------------->
> > + *           freerun^          setpoint^                 limit^   dirty pages
> > + *
> > + * (o) bdi control lines
> > + *
> > + * The control lines for the global/bdi setpoints both stretch up to @limit.
> > + * The below figure illustrates the main bdi control line with an auxiliary
> > + * line extending it to @limit.
> > + *
> > + *   o
> > + *     o
> > + *       o                                      [o] main control line
> > + *         o                                    [*] auxiliary control line
> > + *           o
> > + *             o
> > + *               o
> > + *                 o
> > + *                   o
> > + *                     o
> > + *                       o--------------------- balance point, rate scale = 1
> > + *                       | o
> > + *                       |   o
> > + *                       |     o
> > + *                       |       o
> > + *                       |         o
> > + *                       |           o
> > + *                       |             o------- connect point, rate scale = 1/2
> > + *                       |               .*
> > + *                       |                 .   *
> > + *                       |                   .      *
> > + *                       |                     .         *
> > + *                       |                       .           *
> > + *                       |                         .              *
> > + *                       |                           .                 *
> > + *  [--------------------+-----------------------------.--------------------*]
> > + *  0                 setpoint                     x_intercept           limit
> > + *
> > + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> > + * normal if it starts high in situations like
> > + * - start writing to a slow SD card and a fast disk at the same time. The SD
> > + *   card's bdi_dirty may rush to many times higher than bdi setpoint.
> > + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> > + */
> > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > +					unsigned long thresh,
> > +					unsigned long bg_thresh,
> > +					unsigned long dirty,
> > +					unsigned long bdi_thresh,
> > +					unsigned long bdi_dirty)
> > +{
> > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > +	unsigned long limit = hard_dirty_limit(thresh);
> > +	unsigned long x_intercept;
> > +	unsigned long setpoint;		/* the target balance point */
> > +	unsigned long span;
> > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > +	long x;
> > +
> > +	if (unlikely(dirty >= limit))
> > +		return 0;
> > +
> > +	/*
> > +	 * global setpoint
> > +	 *
> > +	 *                         setpoint - dirty 3
> > +	 *        f(dirty) := 1 + (----------------)
> > +	 *                         limit - setpoint
> > +	 *
> > +	 * it's a 3rd order polynomial that subjects to
> > +	 *
> > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > +	 * (2) f(setpoint) = 1.0 => the balance point
> > +	 * (3) f(limit)    = 0   => the hard limit
> > +	 * (4) df/dx       < 0	 => negative feedback control
> > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > +	 *     => fast response on large errors; small oscillation near setpoint
> > +	 */
> > +	setpoint = (freerun + limit) / 2;
> > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > +		    limit - setpoint + 1);
> > +	pos_ratio = x;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > +
> > +	/*
> > +	 * bdi setpoint
> > +	 *
> > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> > +	 *
> > +	 * The main bdi control line is a linear function that subjects to
> > +	 *
> > +	 * (1) f(setpoint) = 1.0
> > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > +	 *
> > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > +	 * regularly within range
> > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > +	 * fluctuation range for pos_ratio.
> > +	 *
> > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > +	 * own size, so move the slope over accordingly.
> > +	 */
> > +	if (unlikely(bdi_thresh > thresh))
> > +		bdi_thresh = thresh;
> > +	/*
> > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > +	 */
> > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > +	setpoint = setpoint * (u64)x >> 16;
> > +	/*
> > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > +	 */
> > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > +		       thresh + 1);
> > +	x_intercept = setpoint + 2 * span;
> > +
> 
> Hi Fengguang,
> 
> Few very basic queries.
> 
> - Why can't we use the same formula for bdi position ratio as gloabl
>   position ratio. Are you not looking for similar proporties. Near the
>   set point variation is less and away from setup poing throttling is
>   faster.

The changelog has more details, however I hope the rephrased summary
can answer this question better.

Firstly, for single bdi case, the different bdi/global formula is
complementing each other, where the bdi's slope is proportional to the
writeout bandwidth, while the global one is scaling to memory size.
In huge memory system, the global position feedback becomes very weak
(even far away from the setpoint).  This is where the bdi control line
can help pull the dirty pages to the setpoint.

Secondly, for JBOD case, the global/bdi dirty thresholds are
fundamentally different. The global one is stable and strong limit,
while the bdi one is fluctuating and hence only suitable be taken as a
weak limit. The other reason to make it a weak limit is, there are
valid situations that (bdi_dirty >> bdi_thresh) and it's desirable to
throttle the dirtier in reasonable small rate rather than to hard
throttle it.

> - In the bdi calculation, setpoint seems to be in number of pages and 
>   limit (x_intercept) seems to be a combination of nr pages + pages/sec.
>   Why it is different from gloabl setpoint and limit. I mean could this
>   not have been like global calculation where we try to keep bdi_dirty
>   close to bdi_thresh and calculate pos_ratio. 

Because the bdi dirty pages are observed to typically fluctuate up to
1-second worth of data. So the write_bw used here is really (1s * write_bw).

> - In global pos_ratio calculation terminology used is "limit" while
>   the same thing seems be being meintioned as x_intercept in bdi position
>   ratio calculation.

Yes. Because the bdi control lines don't intent to do hard limit at all.

It's actually possible for x_intercept to become larger than the global limit.
This means the it's a memory tight system (or the storage is super fast)
where the bdi dirty pages will inevitably fluctuate a lot (up to write_bw).
We just let go of them and let the global formula take the control.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-19  2:54       ` Wu Fengguang
@ 2011-08-19 19:00         ` Vivek Goyal
  -1 siblings, 0 replies; 203+ messages in thread
From: Vivek Goyal @ 2011-08-19 19:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote:
> Hi Vivek,
> 
> > > +		base_rate = bdi->dirty_ratelimit;
> > > +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > > +					       background_thresh, nr_dirty,
> > > +					       bdi_thresh, bdi_dirty);
> > > +		if (unlikely(pos_ratio == 0)) {
> > > +			pause = MAX_PAUSE;
> > > +			goto pause;
> > >  		}
> > > +		task_ratelimit = (u64)base_rate *
> > > +					pos_ratio >> RATELIMIT_CALC_SHIFT;
> > 
> > Hi Fenguaang,
> > 
> > I am little confused here. I see that you have already taken pos_ratio
> > into account in bdi_update_dirty_ratelimit() and wondering why to take
> > that into account again in balance_diry_pages().
> > 
> > We calculated the pos_rate and balanced_rate and adjusted the
> > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().
> 
> Good question. There are some inter-dependencies in the calculation,
> and the dependency chain is the opposite to the one in your mind:
> balance_dirty_pages() used pos_ratio in the first place, so that
> bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation
> of the balanced dirty rate, too.
> 
> Let's return to how the balanced dirty rate is estimated. Please pay
> special attention to the last paragraphs below the "......" line.
> 
> Start by throttling each dd task at rate
> 
>         task_ratelimit = task_ratelimit_0                               (1)
>                          (any non-zero initial value is OK)
> 
> After 200ms, we measured
> 
>         dirty_rate = # of pages dirtied by all dd's / 200ms
>         write_bw   = # of pages written to the disk / 200ms
> 
> For the aggressive dd dirtiers, the equality holds
> 
>         dirty_rate == N * task_rate
>                    == N * task_ratelimit
>                    == N * task_ratelimit_0                              (2)
> Or     
>         task_ratelimit_0 = dirty_rate / N                               (3)
> 
> Now we conclude that the balanced task ratelimit can be estimated by
> 
>         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (4)
> 
> Because with (2) and (3), (4) yields the desired equality (1):
> 
>         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
>                       == write_bw / N

Hi Fengguang,

Following is my understanding. Please correct me where I got it wrong.

Ok, I think I follow till this point. I think what you are saying is
that following is our goal in a stable system.

	task_ratelimit = write_bw/N				(6)

So we measure the write_bw of a bdi over a period of time and use that
as feedback loop to modify bdi->dirty_ratelimit which inturn modifies
task_ratelimit and hence we achieve the balance. So we will start with
some arbitrary task limit say task_ratelimit_0, and modify that limit
over a period of time based on our feedback loop to achieve a balanced
system. And following seems to be the formula.
					    write_bw
	task_ratelimit = task_ratelimit_0 * ------- 		(7)
					    dirty_rate

Now I also understand that by using (2) and (3), you proved that
how (7) will lead to (6) and that is our deisred goal. 

> 
> .............................................................................
> 
> Now let's revisit (1). Since balance_dirty_pages() chooses to execute
> the ratelimit
> 
>         task_ratelimit = task_ratelimit_0
>                        = dirty_ratelimit * pos_ratio                    (5)
> 

So balance_drity_pages() chose to take into account pos_ratio() also
because for various reason like just taking into account only bandwidth
variation as feedback was not sufficient. So we also took pos_ratio
into account which in-trun is dependent on gloabal dirty pages and per
bdi dirty_pages/rate.

So we refined the formula for calculating a tasks's effective rate
over a period of time to following.
					    write_bw
	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
					    dirty_rate

Is my understanding right so far?

> Put (5) into (4), we get the final form used in
> bdi_update_dirty_ratelimit()
> 
>         balanced_rate = (dirty_ratelimit * pos_ratio) * (write_bw / dirty_rate)
> 
> So you really need to take (dirty_ratelimit * pos_ratio) as a single entity.

Now few questions.

- What is dirty_ratelimit in formula above?

- Is it wrong to understand the issue in following manner.

  bdi->dirty_ratelimit is tracking write bandwidth variation on the bdi
  and effectively tracks write_bw/N.

  bdi->dirty_ratelimit = write_bw/N

  or 

					    		  write_bw
  bdi->dirty_ratelimit = previous_bdi->dirty_ratelimit * -------------    (10)
					     		  dirty_rate

 Hence a tasks's balanced rate from (9) and (10) is.

 task_ratelimit = bdi->dirty_ratelimit * pos_ratio		(11)

So my understanding about (10) and (11) is wrong? if no, then question
comes that bdi->dirty_ratelimit is supposed to be keeping track of 
write bandwidth variations only. And in turn task ratelimit will be
driven by both bandwidth varation as well as pos_ratio variation.

But you seem to be doing following.

 bdi->dirty_ratelimit = adjust based on a cobination of bandwidth feedback
		        and pos_ratio feedback. 

 task_ratelimit = bdi->dirty_ratelimit * pos_ratio		(12)

So my question is that when task_ratelimit is finally being adjusted 
based on pos_ratio feedback, why bdi->dirty_ratelimit also needs to
take that into account.

I know you have tried explaining it, but sorry, I did not get it. May
be give it another shot in a layman's terms and I might understand it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-19 19:00         ` Vivek Goyal
  0 siblings, 0 replies; 203+ messages in thread
From: Vivek Goyal @ 2011-08-19 19:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote:
> Hi Vivek,
> 
> > > +		base_rate = bdi->dirty_ratelimit;
> > > +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > > +					       background_thresh, nr_dirty,
> > > +					       bdi_thresh, bdi_dirty);
> > > +		if (unlikely(pos_ratio == 0)) {
> > > +			pause = MAX_PAUSE;
> > > +			goto pause;
> > >  		}
> > > +		task_ratelimit = (u64)base_rate *
> > > +					pos_ratio >> RATELIMIT_CALC_SHIFT;
> > 
> > Hi Fenguaang,
> > 
> > I am little confused here. I see that you have already taken pos_ratio
> > into account in bdi_update_dirty_ratelimit() and wondering why to take
> > that into account again in balance_diry_pages().
> > 
> > We calculated the pos_rate and balanced_rate and adjusted the
> > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().
> 
> Good question. There are some inter-dependencies in the calculation,
> and the dependency chain is the opposite to the one in your mind:
> balance_dirty_pages() used pos_ratio in the first place, so that
> bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation
> of the balanced dirty rate, too.
> 
> Let's return to how the balanced dirty rate is estimated. Please pay
> special attention to the last paragraphs below the "......" line.
> 
> Start by throttling each dd task at rate
> 
>         task_ratelimit = task_ratelimit_0                               (1)
>                          (any non-zero initial value is OK)
> 
> After 200ms, we measured
> 
>         dirty_rate = # of pages dirtied by all dd's / 200ms
>         write_bw   = # of pages written to the disk / 200ms
> 
> For the aggressive dd dirtiers, the equality holds
> 
>         dirty_rate == N * task_rate
>                    == N * task_ratelimit
>                    == N * task_ratelimit_0                              (2)
> Or     
>         task_ratelimit_0 = dirty_rate / N                               (3)
> 
> Now we conclude that the balanced task ratelimit can be estimated by
> 
>         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (4)
> 
> Because with (2) and (3), (4) yields the desired equality (1):
> 
>         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
>                       == write_bw / N

Hi Fengguang,

Following is my understanding. Please correct me where I got it wrong.

Ok, I think I follow till this point. I think what you are saying is
that following is our goal in a stable system.

	task_ratelimit = write_bw/N				(6)

So we measure the write_bw of a bdi over a period of time and use that
as feedback loop to modify bdi->dirty_ratelimit which inturn modifies
task_ratelimit and hence we achieve the balance. So we will start with
some arbitrary task limit say task_ratelimit_0, and modify that limit
over a period of time based on our feedback loop to achieve a balanced
system. And following seems to be the formula.
					    write_bw
	task_ratelimit = task_ratelimit_0 * ------- 		(7)
					    dirty_rate

Now I also understand that by using (2) and (3), you proved that
how (7) will lead to (6) and that is our deisred goal. 

> 
> .............................................................................
> 
> Now let's revisit (1). Since balance_dirty_pages() chooses to execute
> the ratelimit
> 
>         task_ratelimit = task_ratelimit_0
>                        = dirty_ratelimit * pos_ratio                    (5)
> 

So balance_drity_pages() chose to take into account pos_ratio() also
because for various reason like just taking into account only bandwidth
variation as feedback was not sufficient. So we also took pos_ratio
into account which in-trun is dependent on gloabal dirty pages and per
bdi dirty_pages/rate.

So we refined the formula for calculating a tasks's effective rate
over a period of time to following.
					    write_bw
	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
					    dirty_rate

Is my understanding right so far?

> Put (5) into (4), we get the final form used in
> bdi_update_dirty_ratelimit()
> 
>         balanced_rate = (dirty_ratelimit * pos_ratio) * (write_bw / dirty_rate)
> 
> So you really need to take (dirty_ratelimit * pos_ratio) as a single entity.

Now few questions.

- What is dirty_ratelimit in formula above?

- Is it wrong to understand the issue in following manner.

  bdi->dirty_ratelimit is tracking write bandwidth variation on the bdi
  and effectively tracks write_bw/N.

  bdi->dirty_ratelimit = write_bw/N

  or 

					    		  write_bw
  bdi->dirty_ratelimit = previous_bdi->dirty_ratelimit * -------------    (10)
					     		  dirty_rate

 Hence a tasks's balanced rate from (9) and (10) is.

 task_ratelimit = bdi->dirty_ratelimit * pos_ratio		(11)

So my understanding about (10) and (11) is wrong? if no, then question
comes that bdi->dirty_ratelimit is supposed to be keeping track of 
write bandwidth variations only. And in turn task ratelimit will be
driven by both bandwidth varation as well as pos_ratio variation.

But you seem to be doing following.

 bdi->dirty_ratelimit = adjust based on a cobination of bandwidth feedback
		        and pos_ratio feedback. 

 task_ratelimit = bdi->dirty_ratelimit * pos_ratio		(12)

So my question is that when task_ratelimit is finally being adjusted 
based on pos_ratio feedback, why bdi->dirty_ratelimit also needs to
take that into account.

I know you have tried explaining it, but sorry, I did not get it. May
be give it another shot in a layman's terms and I might understand it.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-19 19:00         ` Vivek Goyal
@ 2011-08-21  3:46           ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-21  3:46 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Sat, Aug 20, 2011 at 03:00:37AM +0800, Vivek Goyal wrote:
> On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote:
> > Hi Vivek,
> > 
> > > > +		base_rate = bdi->dirty_ratelimit;
> > > > +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > > > +					       background_thresh, nr_dirty,
> > > > +					       bdi_thresh, bdi_dirty);
> > > > +		if (unlikely(pos_ratio == 0)) {
> > > > +			pause = MAX_PAUSE;
> > > > +			goto pause;
> > > >  		}
> > > > +		task_ratelimit = (u64)base_rate *
> > > > +					pos_ratio >> RATELIMIT_CALC_SHIFT;
> > > 
> > > Hi Fenguaang,
> > > 
> > > I am little confused here. I see that you have already taken pos_ratio
> > > into account in bdi_update_dirty_ratelimit() and wondering why to take
> > > that into account again in balance_diry_pages().
> > > 
> > > We calculated the pos_rate and balanced_rate and adjusted the
> > > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().
> > 
> > Good question. There are some inter-dependencies in the calculation,
> > and the dependency chain is the opposite to the one in your mind:
> > balance_dirty_pages() used pos_ratio in the first place, so that
> > bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation
> > of the balanced dirty rate, too.
> > 
> > Let's return to how the balanced dirty rate is estimated. Please pay
> > special attention to the last paragraphs below the "......" line.
> > 
> > Start by throttling each dd task at rate
> > 
> >         task_ratelimit = task_ratelimit_0                               (1)
> >                          (any non-zero initial value is OK)
> > 
> > After 200ms, we measured
> > 
> >         dirty_rate = # of pages dirtied by all dd's / 200ms
> >         write_bw   = # of pages written to the disk / 200ms
> > 
> > For the aggressive dd dirtiers, the equality holds
> > 
> >         dirty_rate == N * task_rate
> >                    == N * task_ratelimit
> >                    == N * task_ratelimit_0                              (2)
> > Or     
> >         task_ratelimit_0 = dirty_rate / N                               (3)
> > 
> > Now we conclude that the balanced task ratelimit can be estimated by
> > 
> >         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (4)
> > 
> > Because with (2) and (3), (4) yields the desired equality (1):
> > 
> >         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> >                       == write_bw / N
> 
> Hi Fengguang,
> 
> Following is my understanding. Please correct me where I got it wrong.
> 
> Ok, I think I follow till this point. I think what you are saying is
> that following is our goal in a stable system.
> 
> 	task_ratelimit = write_bw/N				(6)
> 
> So we measure the write_bw of a bdi over a period of time and use that
> as feedback loop to modify bdi->dirty_ratelimit which inturn modifies
> task_ratelimit and hence we achieve the balance. So we will start with
> some arbitrary task limit say task_ratelimit_0, and modify that limit
> over a period of time based on our feedback loop to achieve a balanced
> system. And following seems to be the formula.
> 					    write_bw
> 	task_ratelimit = task_ratelimit_0 * ------- 		(7)
> 					    dirty_rate
> 
> Now I also understand that by using (2) and (3), you proved that
> how (7) will lead to (6) and that is our deisred goal. 

That's right.

> > 
> > .............................................................................
> > 
> > Now let's revisit (1). Since balance_dirty_pages() chooses to execute
> > the ratelimit
> > 
> >         task_ratelimit = task_ratelimit_0
> >                        = dirty_ratelimit * pos_ratio                    (5)
> > 
> 
> So balance_drity_pages() chose to take into account pos_ratio() also
> because for various reason like just taking into account only bandwidth
> variation as feedback was not sufficient. So we also took pos_ratio
> into account which in-trun is dependent on gloabal dirty pages and per
> bdi dirty_pages/rate.

That's right so far. balance_drity_pages() needs to do dirty position
control, so used formula (5).

> So we refined the formula for calculating a tasks's effective rate
> over a period of time to following.
> 					    write_bw
> 	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
> 					    dirty_rate
> 

That's not true. It should still be formula (7) when
balance_drity_pages() considers pos_ratio.

> > Put (5) into (4), we get the final form used in
> > bdi_update_dirty_ratelimit()
> > 
> >         balanced_rate = (dirty_ratelimit * pos_ratio) * (write_bw / dirty_rate)
> > 
> > So you really need to take (dirty_ratelimit * pos_ratio) as a single entity.
> 
> Now few questions.
> 
> - What is dirty_ratelimit in formula above?

It's bdi->dirty_ratelimit.

> - Is it wrong to understand the issue in following manner.
> 
>   bdi->dirty_ratelimit is tracking write bandwidth variation on the bdi
>   and effectively tracks write_bw/N.
> 
>   bdi->dirty_ratelimit = write_bw/N

Yes. Strictly speaking, the target value is (note the "==")

        bdi->dirty_ratelimit == write_bw/N

>   or 
> 
> 					    		  write_bw
>   bdi->dirty_ratelimit = previous_bdi->dirty_ratelimit * -------------    (10)
> 					     		  dirty_rate

Both (9) and (10) are not true. The right form is

                                                                     write_bw
balanced_rate = whatever_ratelimit_executed_in_balance_dirty_pages * ----------
                                                                     dirty_rate

where

whatever_ratelimit_executed_in_balance_dirty_pages ~= bdi->dirty_ratelimit * pos_ratio
bdi->dirty_ratelimit ~= balanced_rate

>  Hence a tasks's balanced rate from (9) and (10) is.
> 
>  task_ratelimit = bdi->dirty_ratelimit * pos_ratio		(11)
> So my understanding about (10) and (11) is wrong? if no, then question
> comes that

(11) in itself is right. It's the exact form used in code.
 
> bdi->dirty_ratelimit is supposed to be keeping track of 
> write bandwidth variations only.

Yes in a stable workload. Besides, if the number of dd tasks (N)
changed, dirty_ratelimit will adapt to new value (write_bw / N).

> And in turn task ratelimit will be
> driven by both bandwidth varation as well as pos_ratio variation.

That's right.
 
> But you seem to be doing following.
> 
>  bdi->dirty_ratelimit = adjust based on a cobination of bandwidth feedback
> 		        and pos_ratio feedback. 
> 
>  task_ratelimit = bdi->dirty_ratelimit * pos_ratio		(12)
> 
> So my question is that when task_ratelimit is finally being adjusted 
> based on pos_ratio feedback, why bdi->dirty_ratelimit also needs to
> take that into account.

In _concept_, bdi->dirty_ratelimit only depends on
whatever_ratelimit_executed_in_balance_dirty_pages.

Then, we try to estimate the latter with formula

whatever_ratelimit_executed_in_balance_dirty_pages ~= bdi->dirty_ratelimit * pos_ratio

That is the main reason we want to limit the step size of bdi->dirty_ratelimit:
otherwise the above estimation will have big errors if bdi->dirty_ratelimit
has changed a lot during the past 200ms.

That's also the reason balanced_rate will have larger errors when
close to @limit: because there pos_ratio drops _quickly_ to 0, hence
the regular fluctuations in dirty pages will result in big
fluctuations in the _relative_ value of pos_ratio.

> I know you have tried explaining it, but sorry, I did not get it. May
> be give it another shot in a layman's terms and I might understand it.

Sorry for that. I can explain if you have more questions :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-21  3:46           ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-21  3:46 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Sat, Aug 20, 2011 at 03:00:37AM +0800, Vivek Goyal wrote:
> On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote:
> > Hi Vivek,
> > 
> > > > +		base_rate = bdi->dirty_ratelimit;
> > > > +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > > > +					       background_thresh, nr_dirty,
> > > > +					       bdi_thresh, bdi_dirty);
> > > > +		if (unlikely(pos_ratio == 0)) {
> > > > +			pause = MAX_PAUSE;
> > > > +			goto pause;
> > > >  		}
> > > > +		task_ratelimit = (u64)base_rate *
> > > > +					pos_ratio >> RATELIMIT_CALC_SHIFT;
> > > 
> > > Hi Fenguaang,
> > > 
> > > I am little confused here. I see that you have already taken pos_ratio
> > > into account in bdi_update_dirty_ratelimit() and wondering why to take
> > > that into account again in balance_diry_pages().
> > > 
> > > We calculated the pos_rate and balanced_rate and adjusted the
> > > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().
> > 
> > Good question. There are some inter-dependencies in the calculation,
> > and the dependency chain is the opposite to the one in your mind:
> > balance_dirty_pages() used pos_ratio in the first place, so that
> > bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation
> > of the balanced dirty rate, too.
> > 
> > Let's return to how the balanced dirty rate is estimated. Please pay
> > special attention to the last paragraphs below the "......" line.
> > 
> > Start by throttling each dd task at rate
> > 
> >         task_ratelimit = task_ratelimit_0                               (1)
> >                          (any non-zero initial value is OK)
> > 
> > After 200ms, we measured
> > 
> >         dirty_rate = # of pages dirtied by all dd's / 200ms
> >         write_bw   = # of pages written to the disk / 200ms
> > 
> > For the aggressive dd dirtiers, the equality holds
> > 
> >         dirty_rate == N * task_rate
> >                    == N * task_ratelimit
> >                    == N * task_ratelimit_0                              (2)
> > Or     
> >         task_ratelimit_0 = dirty_rate / N                               (3)
> > 
> > Now we conclude that the balanced task ratelimit can be estimated by
> > 
> >         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (4)
> > 
> > Because with (2) and (3), (4) yields the desired equality (1):
> > 
> >         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> >                       == write_bw / N
> 
> Hi Fengguang,
> 
> Following is my understanding. Please correct me where I got it wrong.
> 
> Ok, I think I follow till this point. I think what you are saying is
> that following is our goal in a stable system.
> 
> 	task_ratelimit = write_bw/N				(6)
> 
> So we measure the write_bw of a bdi over a period of time and use that
> as feedback loop to modify bdi->dirty_ratelimit which inturn modifies
> task_ratelimit and hence we achieve the balance. So we will start with
> some arbitrary task limit say task_ratelimit_0, and modify that limit
> over a period of time based on our feedback loop to achieve a balanced
> system. And following seems to be the formula.
> 					    write_bw
> 	task_ratelimit = task_ratelimit_0 * ------- 		(7)
> 					    dirty_rate
> 
> Now I also understand that by using (2) and (3), you proved that
> how (7) will lead to (6) and that is our deisred goal. 

That's right.

> > 
> > .............................................................................
> > 
> > Now let's revisit (1). Since balance_dirty_pages() chooses to execute
> > the ratelimit
> > 
> >         task_ratelimit = task_ratelimit_0
> >                        = dirty_ratelimit * pos_ratio                    (5)
> > 
> 
> So balance_drity_pages() chose to take into account pos_ratio() also
> because for various reason like just taking into account only bandwidth
> variation as feedback was not sufficient. So we also took pos_ratio
> into account which in-trun is dependent on gloabal dirty pages and per
> bdi dirty_pages/rate.

That's right so far. balance_drity_pages() needs to do dirty position
control, so used formula (5).

> So we refined the formula for calculating a tasks's effective rate
> over a period of time to following.
> 					    write_bw
> 	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
> 					    dirty_rate
> 

That's not true. It should still be formula (7) when
balance_drity_pages() considers pos_ratio.

> > Put (5) into (4), we get the final form used in
> > bdi_update_dirty_ratelimit()
> > 
> >         balanced_rate = (dirty_ratelimit * pos_ratio) * (write_bw / dirty_rate)
> > 
> > So you really need to take (dirty_ratelimit * pos_ratio) as a single entity.
> 
> Now few questions.
> 
> - What is dirty_ratelimit in formula above?

It's bdi->dirty_ratelimit.

> - Is it wrong to understand the issue in following manner.
> 
>   bdi->dirty_ratelimit is tracking write bandwidth variation on the bdi
>   and effectively tracks write_bw/N.
> 
>   bdi->dirty_ratelimit = write_bw/N

Yes. Strictly speaking, the target value is (note the "==")

        bdi->dirty_ratelimit == write_bw/N

>   or 
> 
> 					    		  write_bw
>   bdi->dirty_ratelimit = previous_bdi->dirty_ratelimit * -------------    (10)
> 					     		  dirty_rate

Both (9) and (10) are not true. The right form is

                                                                     write_bw
balanced_rate = whatever_ratelimit_executed_in_balance_dirty_pages * ----------
                                                                     dirty_rate

where

whatever_ratelimit_executed_in_balance_dirty_pages ~= bdi->dirty_ratelimit * pos_ratio
bdi->dirty_ratelimit ~= balanced_rate

>  Hence a tasks's balanced rate from (9) and (10) is.
> 
>  task_ratelimit = bdi->dirty_ratelimit * pos_ratio		(11)
> So my understanding about (10) and (11) is wrong? if no, then question
> comes that

(11) in itself is right. It's the exact form used in code.
 
> bdi->dirty_ratelimit is supposed to be keeping track of 
> write bandwidth variations only.

Yes in a stable workload. Besides, if the number of dd tasks (N)
changed, dirty_ratelimit will adapt to new value (write_bw / N).

> And in turn task ratelimit will be
> driven by both bandwidth varation as well as pos_ratio variation.

That's right.
 
> But you seem to be doing following.
> 
>  bdi->dirty_ratelimit = adjust based on a cobination of bandwidth feedback
> 		        and pos_ratio feedback. 
> 
>  task_ratelimit = bdi->dirty_ratelimit * pos_ratio		(12)
> 
> So my question is that when task_ratelimit is finally being adjusted 
> based on pos_ratio feedback, why bdi->dirty_ratelimit also needs to
> take that into account.

In _concept_, bdi->dirty_ratelimit only depends on
whatever_ratelimit_executed_in_balance_dirty_pages.

Then, we try to estimate the latter with formula

whatever_ratelimit_executed_in_balance_dirty_pages ~= bdi->dirty_ratelimit * pos_ratio

That is the main reason we want to limit the step size of bdi->dirty_ratelimit:
otherwise the above estimation will have big errors if bdi->dirty_ratelimit
has changed a lot during the past 200ms.

That's also the reason balanced_rate will have larger errors when
close to @limit: because there pos_ratio drops _quickly_ to 0, hence
the regular fluctuations in dirty pages will result in big
fluctuations in the _relative_ value of pos_ratio.

> I know you have tried explaining it, but sorry, I did not get it. May
> be give it another shot in a layman's terms and I might understand it.

Sorry for that. I can explain if you have more questions :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-21  3:46           ` Wu Fengguang
@ 2011-08-22 17:22             ` Vivek Goyal
  -1 siblings, 0 replies; 203+ messages in thread
From: Vivek Goyal @ 2011-08-22 17:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Sun, Aug 21, 2011 at 11:46:58AM +0800, Wu Fengguang wrote:
> On Sat, Aug 20, 2011 at 03:00:37AM +0800, Vivek Goyal wrote:
> > On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote:
> > > Hi Vivek,
> > > 
> > > > > +		base_rate = bdi->dirty_ratelimit;
> > > > > +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > > > > +					       background_thresh, nr_dirty,
> > > > > +					       bdi_thresh, bdi_dirty);
> > > > > +		if (unlikely(pos_ratio == 0)) {
> > > > > +			pause = MAX_PAUSE;
> > > > > +			goto pause;
> > > > >  		}
> > > > > +		task_ratelimit = (u64)base_rate *
> > > > > +					pos_ratio >> RATELIMIT_CALC_SHIFT;
> > > > 
> > > > Hi Fenguaang,
> > > > 
> > > > I am little confused here. I see that you have already taken pos_ratio
> > > > into account in bdi_update_dirty_ratelimit() and wondering why to take
> > > > that into account again in balance_diry_pages().
> > > > 
> > > > We calculated the pos_rate and balanced_rate and adjusted the
> > > > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().
> > > 
> > > Good question. There are some inter-dependencies in the calculation,
> > > and the dependency chain is the opposite to the one in your mind:
> > > balance_dirty_pages() used pos_ratio in the first place, so that
> > > bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation
> > > of the balanced dirty rate, too.
> > > 
> > > Let's return to how the balanced dirty rate is estimated. Please pay
> > > special attention to the last paragraphs below the "......" line.
> > > 
> > > Start by throttling each dd task at rate
> > > 
> > >         task_ratelimit = task_ratelimit_0                               (1)
> > >                          (any non-zero initial value is OK)
> > > 
> > > After 200ms, we measured
> > > 
> > >         dirty_rate = # of pages dirtied by all dd's / 200ms
> > >         write_bw   = # of pages written to the disk / 200ms
> > > 
> > > For the aggressive dd dirtiers, the equality holds
> > > 
> > >         dirty_rate == N * task_rate
> > >                    == N * task_ratelimit
> > >                    == N * task_ratelimit_0                              (2)
> > > Or     
> > >         task_ratelimit_0 = dirty_rate / N                               (3)
> > > 
> > > Now we conclude that the balanced task ratelimit can be estimated by
> > > 
> > >         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (4)
> > > 
> > > Because with (2) and (3), (4) yields the desired equality (1):
> > > 
> > >         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> > >                       == write_bw / N
> > 
> > Hi Fengguang,
> > 
> > Following is my understanding. Please correct me where I got it wrong.
> > 
> > Ok, I think I follow till this point. I think what you are saying is
> > that following is our goal in a stable system.
> > 
> > 	task_ratelimit = write_bw/N				(6)
> > 
> > So we measure the write_bw of a bdi over a period of time and use that
> > as feedback loop to modify bdi->dirty_ratelimit which inturn modifies
> > task_ratelimit and hence we achieve the balance. So we will start with
> > some arbitrary task limit say task_ratelimit_0, and modify that limit
> > over a period of time based on our feedback loop to achieve a balanced
> > system. And following seems to be the formula.
> > 					    write_bw
> > 	task_ratelimit = task_ratelimit_0 * ------- 		(7)
> > 					    dirty_rate
> > 
> > Now I also understand that by using (2) and (3), you proved that
> > how (7) will lead to (6) and that is our deisred goal. 
> 
> That's right.
> 
> > > 
> > > .............................................................................
> > > 
> > > Now let's revisit (1). Since balance_dirty_pages() chooses to execute
> > > the ratelimit
> > > 
> > >         task_ratelimit = task_ratelimit_0
> > >                        = dirty_ratelimit * pos_ratio                    (5)
> > > 
> > 
> > So balance_drity_pages() chose to take into account pos_ratio() also
> > because for various reason like just taking into account only bandwidth
> > variation as feedback was not sufficient. So we also took pos_ratio
> > into account which in-trun is dependent on gloabal dirty pages and per
> > bdi dirty_pages/rate.
> 
> That's right so far. balance_drity_pages() needs to do dirty position
> control, so used formula (5).
> 
> > So we refined the formula for calculating a tasks's effective rate
> > over a period of time to following.
> > 					    write_bw
> > 	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
> > 					    dirty_rate
> > 
> 
> That's not true. It should still be formula (7) when
> balance_drity_pages() considers pos_ratio.

Why it is not true? If I do some math, it sounds right. Let me summarize
my understanding again.

- In a steady state stable system, we want dirty_bw = write_bw, IOW.
 
  dirty_bw/write_bw = 1  		(1)

  If we can achieve above then that means we are throttling tasks at
  just right rate.

Or
-  dirty_bw  == write_bw
   N * task_ratelimit == write_bw
   task_ratelimit =  write_bw/N         (2)

  So as long as we can come up with a system where balance_dirty_pages()
  calculates task_ratelimit to be write_bw/N, we should be fine.

- But this does not take care of imbalances. So if system goes out of
  balance before feedback loop kicks in and dirty rate shoots up, then
  cache size will grow and number of dirty pages will shoot up. Hence
  we brought in the notion of position ratio where we also vary a 
  tasks's dirty ratelimit based on number of dirty pages. So our
  effective formula became.

  task_ratelimit = write_bw/N * pos_ratio     (3)

  So as long as we meet (3), we should reach to stable state.

-  But here N is unknown in advance so balance_drity_pages() can not make
   use of this formula directly. But write_bw and dirty_bw from previous
   200ms are known. So following can replace (3).

				       write_bw
   task_ratelimit = task_ratelimit_0 * --------- * pos_ratio      (4)
					dirty_bw	

   dirty_bw = tas_ratelimit_0 * N                (5)

   Substitute (5) in (4)

   task_ratelimit = write_bw/N * pos_ratio      (6)

   (6) is same as (3) which has been derived from (4) and that means at any
   given point of time (4) can be used by balance_drity_pages() to calculate
   a tasks's throttling rate.

- Now going back to (4). Because we have a feedback loop where we
  continuously update a previous number based on feedback, we can track
  previous value in bdi->dirty_ratelimit.

				       write_bw
   task_ratelimit = task_ratelimit_0 * --------- * pos_ratio 
					dirty_bw	

   Or

   task_ratelimit = bdi->dirty_ratelimit * pos_ratio         (7)

   where
					    write_bw	
  bdi->dirty_ratelimit = task_ratelimit_0 * ---------
					    dirty_bw
  
  Because task_ratelimit_0 is initial value to begin with and we will
  keep on coming with new value every 200ms, we should be able to write
  above as follows.

						      write_bw
  bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
						      dirty_bw

  Effectively we start with an initial value of task_ratelimit_0 and
  then keep on updating it based on rate change feedback every 200ms.

  To summarize,

  We need to achieve (3) for a balanced system. Because we don't know the
  value of N in advance, we can use (4) to achieve effect of (3). So we
  start with a default value of task_ratelimit_0 and update that every
  200ms based on how write and dirty rate on device is changing (8). We also
  further refine that rate by pos_ratio so that any variations in number
  of dirty pages due to temporary imbalances in the system can be
  accounted for (7).

I see that you also use (7). I think only contention point is how
(8) is perceived. So can you please explain why do you think that
above calculation or (9) is wrong.

I can kind of understand that you have done various adjustments to keep the
task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that
I am not able to understand your calculations in updating bdi->dirty_ratelimit.  
Thanks
Vivek

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-22 17:22             ` Vivek Goyal
  0 siblings, 0 replies; 203+ messages in thread
From: Vivek Goyal @ 2011-08-22 17:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Sun, Aug 21, 2011 at 11:46:58AM +0800, Wu Fengguang wrote:
> On Sat, Aug 20, 2011 at 03:00:37AM +0800, Vivek Goyal wrote:
> > On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote:
> > > Hi Vivek,
> > > 
> > > > > +		base_rate = bdi->dirty_ratelimit;
> > > > > +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > > > > +					       background_thresh, nr_dirty,
> > > > > +					       bdi_thresh, bdi_dirty);
> > > > > +		if (unlikely(pos_ratio == 0)) {
> > > > > +			pause = MAX_PAUSE;
> > > > > +			goto pause;
> > > > >  		}
> > > > > +		task_ratelimit = (u64)base_rate *
> > > > > +					pos_ratio >> RATELIMIT_CALC_SHIFT;
> > > > 
> > > > Hi Fenguaang,
> > > > 
> > > > I am little confused here. I see that you have already taken pos_ratio
> > > > into account in bdi_update_dirty_ratelimit() and wondering why to take
> > > > that into account again in balance_diry_pages().
> > > > 
> > > > We calculated the pos_rate and balanced_rate and adjusted the
> > > > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().
> > > 
> > > Good question. There are some inter-dependencies in the calculation,
> > > and the dependency chain is the opposite to the one in your mind:
> > > balance_dirty_pages() used pos_ratio in the first place, so that
> > > bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation
> > > of the balanced dirty rate, too.
> > > 
> > > Let's return to how the balanced dirty rate is estimated. Please pay
> > > special attention to the last paragraphs below the "......" line.
> > > 
> > > Start by throttling each dd task at rate
> > > 
> > >         task_ratelimit = task_ratelimit_0                               (1)
> > >                          (any non-zero initial value is OK)
> > > 
> > > After 200ms, we measured
> > > 
> > >         dirty_rate = # of pages dirtied by all dd's / 200ms
> > >         write_bw   = # of pages written to the disk / 200ms
> > > 
> > > For the aggressive dd dirtiers, the equality holds
> > > 
> > >         dirty_rate == N * task_rate
> > >                    == N * task_ratelimit
> > >                    == N * task_ratelimit_0                              (2)
> > > Or     
> > >         task_ratelimit_0 = dirty_rate / N                               (3)
> > > 
> > > Now we conclude that the balanced task ratelimit can be estimated by
> > > 
> > >         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (4)
> > > 
> > > Because with (2) and (3), (4) yields the desired equality (1):
> > > 
> > >         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> > >                       == write_bw / N
> > 
> > Hi Fengguang,
> > 
> > Following is my understanding. Please correct me where I got it wrong.
> > 
> > Ok, I think I follow till this point. I think what you are saying is
> > that following is our goal in a stable system.
> > 
> > 	task_ratelimit = write_bw/N				(6)
> > 
> > So we measure the write_bw of a bdi over a period of time and use that
> > as feedback loop to modify bdi->dirty_ratelimit which inturn modifies
> > task_ratelimit and hence we achieve the balance. So we will start with
> > some arbitrary task limit say task_ratelimit_0, and modify that limit
> > over a period of time based on our feedback loop to achieve a balanced
> > system. And following seems to be the formula.
> > 					    write_bw
> > 	task_ratelimit = task_ratelimit_0 * ------- 		(7)
> > 					    dirty_rate
> > 
> > Now I also understand that by using (2) and (3), you proved that
> > how (7) will lead to (6) and that is our deisred goal. 
> 
> That's right.
> 
> > > 
> > > .............................................................................
> > > 
> > > Now let's revisit (1). Since balance_dirty_pages() chooses to execute
> > > the ratelimit
> > > 
> > >         task_ratelimit = task_ratelimit_0
> > >                        = dirty_ratelimit * pos_ratio                    (5)
> > > 
> > 
> > So balance_drity_pages() chose to take into account pos_ratio() also
> > because for various reason like just taking into account only bandwidth
> > variation as feedback was not sufficient. So we also took pos_ratio
> > into account which in-trun is dependent on gloabal dirty pages and per
> > bdi dirty_pages/rate.
> 
> That's right so far. balance_drity_pages() needs to do dirty position
> control, so used formula (5).
> 
> > So we refined the formula for calculating a tasks's effective rate
> > over a period of time to following.
> > 					    write_bw
> > 	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
> > 					    dirty_rate
> > 
> 
> That's not true. It should still be formula (7) when
> balance_drity_pages() considers pos_ratio.

Why it is not true? If I do some math, it sounds right. Let me summarize
my understanding again.

- In a steady state stable system, we want dirty_bw = write_bw, IOW.
 
  dirty_bw/write_bw = 1  		(1)

  If we can achieve above then that means we are throttling tasks at
  just right rate.

Or
-  dirty_bw  == write_bw
   N * task_ratelimit == write_bw
   task_ratelimit =  write_bw/N         (2)

  So as long as we can come up with a system where balance_dirty_pages()
  calculates task_ratelimit to be write_bw/N, we should be fine.

- But this does not take care of imbalances. So if system goes out of
  balance before feedback loop kicks in and dirty rate shoots up, then
  cache size will grow and number of dirty pages will shoot up. Hence
  we brought in the notion of position ratio where we also vary a 
  tasks's dirty ratelimit based on number of dirty pages. So our
  effective formula became.

  task_ratelimit = write_bw/N * pos_ratio     (3)

  So as long as we meet (3), we should reach to stable state.

-  But here N is unknown in advance so balance_drity_pages() can not make
   use of this formula directly. But write_bw and dirty_bw from previous
   200ms are known. So following can replace (3).

				       write_bw
   task_ratelimit = task_ratelimit_0 * --------- * pos_ratio      (4)
					dirty_bw	

   dirty_bw = tas_ratelimit_0 * N                (5)

   Substitute (5) in (4)

   task_ratelimit = write_bw/N * pos_ratio      (6)

   (6) is same as (3) which has been derived from (4) and that means at any
   given point of time (4) can be used by balance_drity_pages() to calculate
   a tasks's throttling rate.

- Now going back to (4). Because we have a feedback loop where we
  continuously update a previous number based on feedback, we can track
  previous value in bdi->dirty_ratelimit.

				       write_bw
   task_ratelimit = task_ratelimit_0 * --------- * pos_ratio 
					dirty_bw	

   Or

   task_ratelimit = bdi->dirty_ratelimit * pos_ratio         (7)

   where
					    write_bw	
  bdi->dirty_ratelimit = task_ratelimit_0 * ---------
					    dirty_bw
  
  Because task_ratelimit_0 is initial value to begin with and we will
  keep on coming with new value every 200ms, we should be able to write
  above as follows.

						      write_bw
  bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
						      dirty_bw

  Effectively we start with an initial value of task_ratelimit_0 and
  then keep on updating it based on rate change feedback every 200ms.

  To summarize,

  We need to achieve (3) for a balanced system. Because we don't know the
  value of N in advance, we can use (4) to achieve effect of (3). So we
  start with a default value of task_ratelimit_0 and update that every
  200ms based on how write and dirty rate on device is changing (8). We also
  further refine that rate by pos_ratio so that any variations in number
  of dirty pages due to temporary imbalances in the system can be
  accounted for (7).

I see that you also use (7). I think only contention point is how
(8) is perceived. So can you please explain why do you think that
above calculation or (9) is wrong.

I can kind of understand that you have done various adjustments to keep the
task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that
I am not able to understand your calculations in updating bdi->dirty_ratelimit.  
Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-22 17:22             ` Vivek Goyal
@ 2011-08-23  1:07               ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-23  1:07 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 01:22:30AM +0800, Vivek Goyal wrote:
> On Sun, Aug 21, 2011 at 11:46:58AM +0800, Wu Fengguang wrote:
> > On Sat, Aug 20, 2011 at 03:00:37AM +0800, Vivek Goyal wrote:
> > > On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote:
> > > > Hi Vivek,
> > > > 
> > > > > > +		base_rate = bdi->dirty_ratelimit;
> > > > > > +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > > > > > +					       background_thresh, nr_dirty,
> > > > > > +					       bdi_thresh, bdi_dirty);
> > > > > > +		if (unlikely(pos_ratio == 0)) {
> > > > > > +			pause = MAX_PAUSE;
> > > > > > +			goto pause;
> > > > > >  		}
> > > > > > +		task_ratelimit = (u64)base_rate *
> > > > > > +					pos_ratio >> RATELIMIT_CALC_SHIFT;
> > > > > 
> > > > > Hi Fenguaang,
> > > > > 
> > > > > I am little confused here. I see that you have already taken pos_ratio
> > > > > into account in bdi_update_dirty_ratelimit() and wondering why to take
> > > > > that into account again in balance_diry_pages().
> > > > > 
> > > > > We calculated the pos_rate and balanced_rate and adjusted the
> > > > > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().
> > > > 
> > > > Good question. There are some inter-dependencies in the calculation,
> > > > and the dependency chain is the opposite to the one in your mind:
> > > > balance_dirty_pages() used pos_ratio in the first place, so that
> > > > bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation
> > > > of the balanced dirty rate, too.
> > > > 
> > > > Let's return to how the balanced dirty rate is estimated. Please pay
> > > > special attention to the last paragraphs below the "......" line.
> > > > 
> > > > Start by throttling each dd task at rate
> > > > 
> > > >         task_ratelimit = task_ratelimit_0                               (1)
> > > >                          (any non-zero initial value is OK)
> > > > 
> > > > After 200ms, we measured
> > > > 
> > > >         dirty_rate = # of pages dirtied by all dd's / 200ms
> > > >         write_bw   = # of pages written to the disk / 200ms
> > > > 
> > > > For the aggressive dd dirtiers, the equality holds
> > > > 
> > > >         dirty_rate == N * task_rate
> > > >                    == N * task_ratelimit
> > > >                    == N * task_ratelimit_0                              (2)
> > > > Or     
> > > >         task_ratelimit_0 = dirty_rate / N                               (3)
> > > > 
> > > > Now we conclude that the balanced task ratelimit can be estimated by
> > > > 
> > > >         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (4)
> > > > 
> > > > Because with (2) and (3), (4) yields the desired equality (1):
> > > > 
> > > >         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> > > >                       == write_bw / N
> > > 
> > > Hi Fengguang,
> > > 
> > > Following is my understanding. Please correct me where I got it wrong.
> > > 
> > > Ok, I think I follow till this point. I think what you are saying is
> > > that following is our goal in a stable system.
> > > 
> > > 	task_ratelimit = write_bw/N				(6)
> > > 
> > > So we measure the write_bw of a bdi over a period of time and use that
> > > as feedback loop to modify bdi->dirty_ratelimit which inturn modifies
> > > task_ratelimit and hence we achieve the balance. So we will start with
> > > some arbitrary task limit say task_ratelimit_0, and modify that limit
> > > over a period of time based on our feedback loop to achieve a balanced
> > > system. And following seems to be the formula.
> > > 					    write_bw
> > > 	task_ratelimit = task_ratelimit_0 * ------- 		(7)
> > > 					    dirty_rate
> > > 
> > > Now I also understand that by using (2) and (3), you proved that
> > > how (7) will lead to (6) and that is our deisred goal. 
> > 
> > That's right.
> > 
> > > > 
> > > > .............................................................................
> > > > 
> > > > Now let's revisit (1). Since balance_dirty_pages() chooses to execute
> > > > the ratelimit
> > > > 
> > > >         task_ratelimit = task_ratelimit_0
> > > >                        = dirty_ratelimit * pos_ratio                    (5)
> > > > 
> > > 
> > > So balance_drity_pages() chose to take into account pos_ratio() also
> > > because for various reason like just taking into account only bandwidth
> > > variation as feedback was not sufficient. So we also took pos_ratio
> > > into account which in-trun is dependent on gloabal dirty pages and per
> > > bdi dirty_pages/rate.
> > 
> > That's right so far. balance_drity_pages() needs to do dirty position
> > control, so used formula (5).
> > 
> > > So we refined the formula for calculating a tasks's effective rate
> > > over a period of time to following.
> > > 					    write_bw
> > > 	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
> > > 					    dirty_rate
> > > 
> > 
> > That's not true. It should still be formula (7) when
> > balance_drity_pages() considers pos_ratio.
> 
> Why it is not true? If I do some math, it sounds right. Let me summarize
> my understanding again.

Ah sorry! (9) actually holds true, as made clear by your below reasoning.

> - In a steady state stable system, we want dirty_bw = write_bw, IOW.
>  
>   dirty_bw/write_bw = 1  		(1)
> 
>   If we can achieve above then that means we are throttling tasks at
>   just right rate.
> 
> Or
> -  dirty_bw  == write_bw
>    N * task_ratelimit == write_bw
>    task_ratelimit =  write_bw/N         (2)
> 
>   So as long as we can come up with a system where balance_dirty_pages()
>   calculates task_ratelimit to be write_bw/N, we should be fine.

Right.

> - But this does not take care of imbalances. So if system goes out of
>   balance before feedback loop kicks in and dirty rate shoots up, then
>   cache size will grow and number of dirty pages will shoot up. Hence
>   we brought in the notion of position ratio where we also vary a 
>   tasks's dirty ratelimit based on number of dirty pages. So our
>   effective formula became.
> 
>   task_ratelimit = write_bw/N * pos_ratio     (3)
> 
>   So as long as we meet (3), we should reach to stable state.

Right.

> -  But here N is unknown in advance so balance_drity_pages() can not make
>    use of this formula directly. But write_bw and dirty_bw from previous
>    200ms are known. So following can replace (3).
> 
> 				       write_bw
>    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio      (4)
> 					dirty_bw	
> 
>    dirty_bw = task_ratelimit_0 * N                (5)
> 
>    Substitute (5) in (4)
> 
>    task_ratelimit = write_bw/N * pos_ratio      (6)
> 
>    (6) is same as (3) which has been derived from (4) and that means at any
>    given point of time (4) can be used by balance_drity_pages() to calculate
>    a tasks's throttling rate.

Right. Sorry what's in my mind was

                                       write_bw
    balanced_rate = task_ratelimit_0 * --------
                                       dirty_bw        

    task_ratelimit = balanced_rate * pos_ratio

which is effective the same to your combined equation (4).

> - Now going back to (4). Because we have a feedback loop where we
>   continuously update a previous number based on feedback, we can track
>   previous value in bdi->dirty_ratelimit.
> 
> 				       write_bw
>    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio 
> 					dirty_bw	
> 
>    Or
> 
>    task_ratelimit = bdi->dirty_ratelimit * pos_ratio         (7)
> 
>    where
> 					    write_bw	
>   bdi->dirty_ratelimit = task_ratelimit_0 * ---------
> 					    dirty_bw

Right.

>   Because task_ratelimit_0 is initial value to begin with and we will
>   keep on coming with new value every 200ms, we should be able to write
>   above as follows.
> 
> 						      write_bw
>   bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
> 						      dirty_bw
> 
>   Effectively we start with an initial value of task_ratelimit_0 and
>   then keep on updating it based on rate change feedback every 200ms.

Right.

>   To summarize,
> 
>   We need to achieve (3) for a balanced system. Because we don't know the
>   value of N in advance, we can use (4) to achieve effect of (3). So we
>   start with a default value of task_ratelimit_0 and update that every
>   200ms based on how write and dirty rate on device is changing (8). We also
>   further refine that rate by pos_ratio so that any variations in number
>   of dirty pages due to temporary imbalances in the system can be
>   accounted for (7).
> 
> I see that you also use (7). I think only contention point is how
> (8) is perceived. So can you please explain why do you think that
> above calculation or (9) is wrong.

There is no contention point and (9) is right..Sorry it's my fault.
We are well aligned in the above reasoning :)

> I can kind of understand that you have done various adjustments to keep the
> task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that
> I am not able to understand your calculations in updating bdi->dirty_ratelimit.  

You mean the below chunk of code? Which is effectively the same as this _one_
line of code

        bdi->dirty_ratelimit = balanced_rate;

except for doing some tricks (conditional update and limiting step size) to
stabilize bdi->dirty_ratelimit:

        unsigned long base_rate = bdi->dirty_ratelimit;

        /*
         * Use a different name for the same value to distinguish the concepts.
         * Only the relative value of
         *     (pos_rate - base_rate) = (pos_ratio - 1) * base_rate
         * will be used below, which reflects the direction and size of dirty
         * position error.
         */
        pos_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT;

        /*
         * dirty_ratelimit will follow balanced_rate iff pos_rate is on the
         * same side of dirty_ratelimit, too.
         * For example,
         * - (base_rate > balanced_rate) => dirty rate is too high
         * - (base_rate > pos_rate)      => dirty pages are above setpoint
         * so lowering base_rate will help meet both the position and rate
         * control targets. Otherwise, don't update base_rate if it will only
         * help meet the rate target. After all, what the users ultimately feel
         * and care are stable dirty rate and small position error.  This
         * update policy can also prevent dirty_ratelimit from being driven
         * away by possible systematic errors in balanced_rate.
         *
         * |base_rate - pos_rate| is also used to limit the step size for
         * filtering out the sigular points of balanced_rate, which keeps
         * jumping around randomly and can even leap far away at times due to
         * the small 200ms estimation period of dirty_rate (we want to keep
         * that period small to reduce time lags).
         */
        delta = 0;
        if (base_rate < balanced_rate) {
                if (base_rate < pos_rate)
                        delta = min(balanced_rate, pos_rate) - base_rate;
        } else {
                if (base_rate > pos_rate)
                        delta = base_rate - max(balanced_rate, pos_rate);
        }
       
        /*
         * Don't pursue 100% rate matching. It's impossible since the balanced
         * rate itself is constantly fluctuating. So decrease the track speed
         * when it gets close to the target. Helps eliminate pointless tremors.
         */
        delta >>= base_rate / (8 * delta + 1);
        /*
         * Limit the tracking speed to avoid overshooting.
         */
        delta = (delta + 7) / 8;

        if (base_rate < balanced_rate)
                base_rate += delta;
        else   
                base_rate -= delta;

        bdi->dirty_ratelimit = max(base_rate, 1UL);

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-23  1:07               ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-23  1:07 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 01:22:30AM +0800, Vivek Goyal wrote:
> On Sun, Aug 21, 2011 at 11:46:58AM +0800, Wu Fengguang wrote:
> > On Sat, Aug 20, 2011 at 03:00:37AM +0800, Vivek Goyal wrote:
> > > On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote:
> > > > Hi Vivek,
> > > > 
> > > > > > +		base_rate = bdi->dirty_ratelimit;
> > > > > > +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > > > > > +					       background_thresh, nr_dirty,
> > > > > > +					       bdi_thresh, bdi_dirty);
> > > > > > +		if (unlikely(pos_ratio == 0)) {
> > > > > > +			pause = MAX_PAUSE;
> > > > > > +			goto pause;
> > > > > >  		}
> > > > > > +		task_ratelimit = (u64)base_rate *
> > > > > > +					pos_ratio >> RATELIMIT_CALC_SHIFT;
> > > > > 
> > > > > Hi Fenguaang,
> > > > > 
> > > > > I am little confused here. I see that you have already taken pos_ratio
> > > > > into account in bdi_update_dirty_ratelimit() and wondering why to take
> > > > > that into account again in balance_diry_pages().
> > > > > 
> > > > > We calculated the pos_rate and balanced_rate and adjusted the
> > > > > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().
> > > > 
> > > > Good question. There are some inter-dependencies in the calculation,
> > > > and the dependency chain is the opposite to the one in your mind:
> > > > balance_dirty_pages() used pos_ratio in the first place, so that
> > > > bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation
> > > > of the balanced dirty rate, too.
> > > > 
> > > > Let's return to how the balanced dirty rate is estimated. Please pay
> > > > special attention to the last paragraphs below the "......" line.
> > > > 
> > > > Start by throttling each dd task at rate
> > > > 
> > > >         task_ratelimit = task_ratelimit_0                               (1)
> > > >                          (any non-zero initial value is OK)
> > > > 
> > > > After 200ms, we measured
> > > > 
> > > >         dirty_rate = # of pages dirtied by all dd's / 200ms
> > > >         write_bw   = # of pages written to the disk / 200ms
> > > > 
> > > > For the aggressive dd dirtiers, the equality holds
> > > > 
> > > >         dirty_rate == N * task_rate
> > > >                    == N * task_ratelimit
> > > >                    == N * task_ratelimit_0                              (2)
> > > > Or     
> > > >         task_ratelimit_0 = dirty_rate / N                               (3)
> > > > 
> > > > Now we conclude that the balanced task ratelimit can be estimated by
> > > > 
> > > >         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (4)
> > > > 
> > > > Because with (2) and (3), (4) yields the desired equality (1):
> > > > 
> > > >         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> > > >                       == write_bw / N
> > > 
> > > Hi Fengguang,
> > > 
> > > Following is my understanding. Please correct me where I got it wrong.
> > > 
> > > Ok, I think I follow till this point. I think what you are saying is
> > > that following is our goal in a stable system.
> > > 
> > > 	task_ratelimit = write_bw/N				(6)
> > > 
> > > So we measure the write_bw of a bdi over a period of time and use that
> > > as feedback loop to modify bdi->dirty_ratelimit which inturn modifies
> > > task_ratelimit and hence we achieve the balance. So we will start with
> > > some arbitrary task limit say task_ratelimit_0, and modify that limit
> > > over a period of time based on our feedback loop to achieve a balanced
> > > system. And following seems to be the formula.
> > > 					    write_bw
> > > 	task_ratelimit = task_ratelimit_0 * ------- 		(7)
> > > 					    dirty_rate
> > > 
> > > Now I also understand that by using (2) and (3), you proved that
> > > how (7) will lead to (6) and that is our deisred goal. 
> > 
> > That's right.
> > 
> > > > 
> > > > .............................................................................
> > > > 
> > > > Now let's revisit (1). Since balance_dirty_pages() chooses to execute
> > > > the ratelimit
> > > > 
> > > >         task_ratelimit = task_ratelimit_0
> > > >                        = dirty_ratelimit * pos_ratio                    (5)
> > > > 
> > > 
> > > So balance_drity_pages() chose to take into account pos_ratio() also
> > > because for various reason like just taking into account only bandwidth
> > > variation as feedback was not sufficient. So we also took pos_ratio
> > > into account which in-trun is dependent on gloabal dirty pages and per
> > > bdi dirty_pages/rate.
> > 
> > That's right so far. balance_drity_pages() needs to do dirty position
> > control, so used formula (5).
> > 
> > > So we refined the formula for calculating a tasks's effective rate
> > > over a period of time to following.
> > > 					    write_bw
> > > 	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
> > > 					    dirty_rate
> > > 
> > 
> > That's not true. It should still be formula (7) when
> > balance_drity_pages() considers pos_ratio.
> 
> Why it is not true? If I do some math, it sounds right. Let me summarize
> my understanding again.

Ah sorry! (9) actually holds true, as made clear by your below reasoning.

> - In a steady state stable system, we want dirty_bw = write_bw, IOW.
>  
>   dirty_bw/write_bw = 1  		(1)
> 
>   If we can achieve above then that means we are throttling tasks at
>   just right rate.
> 
> Or
> -  dirty_bw  == write_bw
>    N * task_ratelimit == write_bw
>    task_ratelimit =  write_bw/N         (2)
> 
>   So as long as we can come up with a system where balance_dirty_pages()
>   calculates task_ratelimit to be write_bw/N, we should be fine.

Right.

> - But this does not take care of imbalances. So if system goes out of
>   balance before feedback loop kicks in and dirty rate shoots up, then
>   cache size will grow and number of dirty pages will shoot up. Hence
>   we brought in the notion of position ratio where we also vary a 
>   tasks's dirty ratelimit based on number of dirty pages. So our
>   effective formula became.
> 
>   task_ratelimit = write_bw/N * pos_ratio     (3)
> 
>   So as long as we meet (3), we should reach to stable state.

Right.

> -  But here N is unknown in advance so balance_drity_pages() can not make
>    use of this formula directly. But write_bw and dirty_bw from previous
>    200ms are known. So following can replace (3).
> 
> 				       write_bw
>    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio      (4)
> 					dirty_bw	
> 
>    dirty_bw = task_ratelimit_0 * N                (5)
> 
>    Substitute (5) in (4)
> 
>    task_ratelimit = write_bw/N * pos_ratio      (6)
> 
>    (6) is same as (3) which has been derived from (4) and that means at any
>    given point of time (4) can be used by balance_drity_pages() to calculate
>    a tasks's throttling rate.

Right. Sorry what's in my mind was

                                       write_bw
    balanced_rate = task_ratelimit_0 * --------
                                       dirty_bw        

    task_ratelimit = balanced_rate * pos_ratio

which is effective the same to your combined equation (4).

> - Now going back to (4). Because we have a feedback loop where we
>   continuously update a previous number based on feedback, we can track
>   previous value in bdi->dirty_ratelimit.
> 
> 				       write_bw
>    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio 
> 					dirty_bw	
> 
>    Or
> 
>    task_ratelimit = bdi->dirty_ratelimit * pos_ratio         (7)
> 
>    where
> 					    write_bw	
>   bdi->dirty_ratelimit = task_ratelimit_0 * ---------
> 					    dirty_bw

Right.

>   Because task_ratelimit_0 is initial value to begin with and we will
>   keep on coming with new value every 200ms, we should be able to write
>   above as follows.
> 
> 						      write_bw
>   bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
> 						      dirty_bw
> 
>   Effectively we start with an initial value of task_ratelimit_0 and
>   then keep on updating it based on rate change feedback every 200ms.

Right.

>   To summarize,
> 
>   We need to achieve (3) for a balanced system. Because we don't know the
>   value of N in advance, we can use (4) to achieve effect of (3). So we
>   start with a default value of task_ratelimit_0 and update that every
>   200ms based on how write and dirty rate on device is changing (8). We also
>   further refine that rate by pos_ratio so that any variations in number
>   of dirty pages due to temporary imbalances in the system can be
>   accounted for (7).
> 
> I see that you also use (7). I think only contention point is how
> (8) is perceived. So can you please explain why do you think that
> above calculation or (9) is wrong.

There is no contention point and (9) is right..Sorry it's my fault.
We are well aligned in the above reasoning :)

> I can kind of understand that you have done various adjustments to keep the
> task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that
> I am not able to understand your calculations in updating bdi->dirty_ratelimit.  

You mean the below chunk of code? Which is effectively the same as this _one_
line of code

        bdi->dirty_ratelimit = balanced_rate;

except for doing some tricks (conditional update and limiting step size) to
stabilize bdi->dirty_ratelimit:

        unsigned long base_rate = bdi->dirty_ratelimit;

        /*
         * Use a different name for the same value to distinguish the concepts.
         * Only the relative value of
         *     (pos_rate - base_rate) = (pos_ratio - 1) * base_rate
         * will be used below, which reflects the direction and size of dirty
         * position error.
         */
        pos_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT;

        /*
         * dirty_ratelimit will follow balanced_rate iff pos_rate is on the
         * same side of dirty_ratelimit, too.
         * For example,
         * - (base_rate > balanced_rate) => dirty rate is too high
         * - (base_rate > pos_rate)      => dirty pages are above setpoint
         * so lowering base_rate will help meet both the position and rate
         * control targets. Otherwise, don't update base_rate if it will only
         * help meet the rate target. After all, what the users ultimately feel
         * and care are stable dirty rate and small position error.  This
         * update policy can also prevent dirty_ratelimit from being driven
         * away by possible systematic errors in balanced_rate.
         *
         * |base_rate - pos_rate| is also used to limit the step size for
         * filtering out the sigular points of balanced_rate, which keeps
         * jumping around randomly and can even leap far away at times due to
         * the small 200ms estimation period of dirty_rate (we want to keep
         * that period small to reduce time lags).
         */
        delta = 0;
        if (base_rate < balanced_rate) {
                if (base_rate < pos_rate)
                        delta = min(balanced_rate, pos_rate) - base_rate;
        } else {
                if (base_rate > pos_rate)
                        delta = base_rate - max(balanced_rate, pos_rate);
        }
       
        /*
         * Don't pursue 100% rate matching. It's impossible since the balanced
         * rate itself is constantly fluctuating. So decrease the track speed
         * when it gets close to the target. Helps eliminate pointless tremors.
         */
        delta >>= base_rate / (8 * delta + 1);
        /*
         * Limit the tracking speed to avoid overshooting.
         */
        delta = (delta + 7) / 8;

        if (base_rate < balanced_rate)
                base_rate += delta;
        else   
                base_rate -= delta;

        bdi->dirty_ratelimit = max(base_rate, 1UL);

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-23  1:07               ` Wu Fengguang
@ 2011-08-23  3:53                 ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-23  3:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

> >   Because task_ratelimit_0 is initial value to begin with and we will
> >   keep on coming with new value every 200ms, we should be able to write
> >   above as follows.
> > 
> > 						      write_bw
> >   bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
> > 						      dirty_bw
> > 
> >   Effectively we start with an initial value of task_ratelimit_0 and
> >   then keep on updating it based on rate change feedback every 200ms.

Ah sorry, based on the reply to Peter, there is no inherent dependency
between balanced_rate_n and balanced_rate_(n-1). bdi->dirty_ratelimit does
track balanced_rate in small steps, and hence will have some relationship
with its previous value other than equation (8).

So, although you may conduct equation (8) for balanced_rate, we'd
better not understand things in that way. Keep this fundamental
formula in mind and don't try to complicate it:

        balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-23  3:53                 ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-23  3:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

> >   Because task_ratelimit_0 is initial value to begin with and we will
> >   keep on coming with new value every 200ms, we should be able to write
> >   above as follows.
> > 
> > 						      write_bw
> >   bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
> > 						      dirty_bw
> > 
> >   Effectively we start with an initial value of task_ratelimit_0 and
> >   then keep on updating it based on rate change feedback every 200ms.

Ah sorry, based on the reply to Peter, there is no inherent dependency
between balanced_rate_n and balanced_rate_(n-1). bdi->dirty_ratelimit does
track balanced_rate in small steps, and hence will have some relationship
with its previous value other than equation (8).

So, although you may conduct equation (8) for balanced_rate, we'd
better not understand things in that way. Keep this fundamental
formula in mind and don't try to complicate it:

        balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-23  1:07               ` Wu Fengguang
@ 2011-08-23 13:53                 ` Vivek Goyal
  -1 siblings, 0 replies; 203+ messages in thread
From: Vivek Goyal @ 2011-08-23 13:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 09:07:21AM +0800, Wu Fengguang wrote:

[..]
> > > > So we refined the formula for calculating a tasks's effective rate
> > > > over a period of time to following.
> > > > 					    write_bw
> > > > 	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
> > > > 					    dirty_rate
> > > > 
> > > 
> > > That's not true. It should still be formula (7) when
> > > balance_drity_pages() considers pos_ratio.
> > 
> > Why it is not true? If I do some math, it sounds right. Let me summarize
> > my understanding again.
> 
> Ah sorry! (9) actually holds true, as made clear by your below reasoning.
> 
> > - In a steady state stable system, we want dirty_bw = write_bw, IOW.
> >  
> >   dirty_bw/write_bw = 1  		(1)
> > 
> >   If we can achieve above then that means we are throttling tasks at
> >   just right rate.
> > 
> > Or
> > -  dirty_bw  == write_bw
> >    N * task_ratelimit == write_bw
> >    task_ratelimit =  write_bw/N         (2)
> > 
> >   So as long as we can come up with a system where balance_dirty_pages()
> >   calculates task_ratelimit to be write_bw/N, we should be fine.
> 
> Right.
> 
> > - But this does not take care of imbalances. So if system goes out of
> >   balance before feedback loop kicks in and dirty rate shoots up, then
> >   cache size will grow and number of dirty pages will shoot up. Hence
> >   we brought in the notion of position ratio where we also vary a 
> >   tasks's dirty ratelimit based on number of dirty pages. So our
> >   effective formula became.
> > 
> >   task_ratelimit = write_bw/N * pos_ratio     (3)
> > 
> >   So as long as we meet (3), we should reach to stable state.
> 
> Right.
> 
> > -  But here N is unknown in advance so balance_drity_pages() can not make
> >    use of this formula directly. But write_bw and dirty_bw from previous
> >    200ms are known. So following can replace (3).
> > 
> > 				       write_bw
> >    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio      (4)
> > 					dirty_bw	
> > 
> >    dirty_bw = task_ratelimit_0 * N                (5)
> > 
> >    Substitute (5) in (4)
> > 
> >    task_ratelimit = write_bw/N * pos_ratio      (6)
> > 
> >    (6) is same as (3) which has been derived from (4) and that means at any
> >    given point of time (4) can be used by balance_drity_pages() to calculate
> >    a tasks's throttling rate.
> 
> Right. Sorry what's in my mind was
> 
>                                        write_bw
>     balanced_rate = task_ratelimit_0 * --------
>                                        dirty_bw        
> 
>     task_ratelimit = balanced_rate * pos_ratio
> 
> which is effective the same to your combined equation (4).
> 
> > - Now going back to (4). Because we have a feedback loop where we
> >   continuously update a previous number based on feedback, we can track
> >   previous value in bdi->dirty_ratelimit.
> > 
> > 				       write_bw
> >    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio 
> > 					dirty_bw	
> > 
> >    Or
> > 
> >    task_ratelimit = bdi->dirty_ratelimit * pos_ratio         (7)
> > 
> >    where
> > 					    write_bw	
> >   bdi->dirty_ratelimit = task_ratelimit_0 * ---------
> > 					    dirty_bw
> 
> Right.
> 
> >   Because task_ratelimit_0 is initial value to begin with and we will
> >   keep on coming with new value every 200ms, we should be able to write
> >   above as follows.
> > 
> > 						      write_bw
> >   bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
> > 						      dirty_bw
> > 
> >   Effectively we start with an initial value of task_ratelimit_0 and
> >   then keep on updating it based on rate change feedback every 200ms.
> 
> Right.
> 
> >   To summarize,
> > 
> >   We need to achieve (3) for a balanced system. Because we don't know the
> >   value of N in advance, we can use (4) to achieve effect of (3). So we
> >   start with a default value of task_ratelimit_0 and update that every
> >   200ms based on how write and dirty rate on device is changing (8). We also
> >   further refine that rate by pos_ratio so that any variations in number
> >   of dirty pages due to temporary imbalances in the system can be
> >   accounted for (7).
> > 
> > I see that you also use (7). I think only contention point is how
> > (8) is perceived. So can you please explain why do you think that
> > above calculation or (9) is wrong.
> 
> There is no contention point and (9) is right..Sorry it's my fault.
> We are well aligned in the above reasoning :)

Great. Now we are on same page now at least till this point.

> 
> > I can kind of understand that you have done various adjustments to keep the
> > task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that
> > I am not able to understand your calculations in updating bdi->dirty_ratelimit.  
> 
> You mean the below chunk of code? Which is effectively the same as this _one_
> line of code
> 
>         bdi->dirty_ratelimit = balanced_rate;
> 
> except for doing some tricks (conditional update and limiting step size) to
> stabilize bdi->dirty_ratelimit:

I am fine with bdi->dirty_ratelimit being called balanced rate. I am
taking exception to the fact that you are also taking into accout
pos_ratio while coming up with new balanced_rate after 200ms of feedback.

We agreed to updating bdi->dirty_ratelimit as follows (8 above).

 
 						      write_bw
   bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
 						      dirty_bw

I think in your terminology it could be called.
					   write_bw
  new_balanced_rate = prev_balanced_rate * ----------            (9)
					   dirty_bw

But what you seem to be doing is following.
							write_bw
  new_balanced_rate = prev_balanced_rate * pos_ratio * -----------  (10)
							dirty_bw

Of course I have just tried to simlify your actual calculations to
show why I am questioning the presence of pos_ratio while calculating
the new bdi->dirty_ratelimit. I am fine with limiting the step size etc.

So (9) and (10) don't match?

Now going back to your code and show how I arrived at (10).

executed_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT; (11)
balanced_rate = div_u64((u64)executed_rate * bdi->avg_write_bandwidth,
			dirty_rate | 1);			(12)

Combining (11) and (12) gives us (10).
				     write_bw
balance_rate = base_rate * pos_ratio --------
				     dirty_rate

Or
					    write_bw
bdi->dirty_ratelimit = base_rate * pos_ratio --------
					     dirty_rate

To complicate the things you also have the notion of pos_rate and reduce
the step size based on either pos_rate or balance_rate.

pos_rate = executed_rate = base_rate * pos_ratio;

				     write_bw
balance_rate = base_rate * pos_ratio --------
				     dirty_rate

bdi->dirty_rate_limit = min_change(pos_rate, balance_rate)       (13)

So for feedback, why are not sticking to simply (9) and limit the step
size and not take pos_ratio into account. 

Even if you have to take it into account, it needs to be explained clearly
and so many rate definitions confuse things more. Keeping name constant
everywhere (even for local variables), helps understand the code better.

Look at number of rates we have in code and it gets so confusing.

balanced_rate
base_rate
bdi->dirty_ratelimit

executed_rate
pos_rate
task_ratelimit

dirty_rate
write_bw

Here balanced_rate, base_rate and bdi->dirty_ratelimit all seem to be
referring to same thing and that is not obivious from the code. Looks
like task->ratelimit and executed_rate and pos_rate are referring to same
thing.

So instead of 6 rates, we could atleast collpase the naming to 2 rates
to keep the context clear. Just prefix/suffix more strings to highlight
subtle difference between two rates.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-23 13:53                 ` Vivek Goyal
  0 siblings, 0 replies; 203+ messages in thread
From: Vivek Goyal @ 2011-08-23 13:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 09:07:21AM +0800, Wu Fengguang wrote:

[..]
> > > > So we refined the formula for calculating a tasks's effective rate
> > > > over a period of time to following.
> > > > 					    write_bw
> > > > 	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
> > > > 					    dirty_rate
> > > > 
> > > 
> > > That's not true. It should still be formula (7) when
> > > balance_drity_pages() considers pos_ratio.
> > 
> > Why it is not true? If I do some math, it sounds right. Let me summarize
> > my understanding again.
> 
> Ah sorry! (9) actually holds true, as made clear by your below reasoning.
> 
> > - In a steady state stable system, we want dirty_bw = write_bw, IOW.
> >  
> >   dirty_bw/write_bw = 1  		(1)
> > 
> >   If we can achieve above then that means we are throttling tasks at
> >   just right rate.
> > 
> > Or
> > -  dirty_bw  == write_bw
> >    N * task_ratelimit == write_bw
> >    task_ratelimit =  write_bw/N         (2)
> > 
> >   So as long as we can come up with a system where balance_dirty_pages()
> >   calculates task_ratelimit to be write_bw/N, we should be fine.
> 
> Right.
> 
> > - But this does not take care of imbalances. So if system goes out of
> >   balance before feedback loop kicks in and dirty rate shoots up, then
> >   cache size will grow and number of dirty pages will shoot up. Hence
> >   we brought in the notion of position ratio where we also vary a 
> >   tasks's dirty ratelimit based on number of dirty pages. So our
> >   effective formula became.
> > 
> >   task_ratelimit = write_bw/N * pos_ratio     (3)
> > 
> >   So as long as we meet (3), we should reach to stable state.
> 
> Right.
> 
> > -  But here N is unknown in advance so balance_drity_pages() can not make
> >    use of this formula directly. But write_bw and dirty_bw from previous
> >    200ms are known. So following can replace (3).
> > 
> > 				       write_bw
> >    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio      (4)
> > 					dirty_bw	
> > 
> >    dirty_bw = task_ratelimit_0 * N                (5)
> > 
> >    Substitute (5) in (4)
> > 
> >    task_ratelimit = write_bw/N * pos_ratio      (6)
> > 
> >    (6) is same as (3) which has been derived from (4) and that means at any
> >    given point of time (4) can be used by balance_drity_pages() to calculate
> >    a tasks's throttling rate.
> 
> Right. Sorry what's in my mind was
> 
>                                        write_bw
>     balanced_rate = task_ratelimit_0 * --------
>                                        dirty_bw        
> 
>     task_ratelimit = balanced_rate * pos_ratio
> 
> which is effective the same to your combined equation (4).
> 
> > - Now going back to (4). Because we have a feedback loop where we
> >   continuously update a previous number based on feedback, we can track
> >   previous value in bdi->dirty_ratelimit.
> > 
> > 				       write_bw
> >    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio 
> > 					dirty_bw	
> > 
> >    Or
> > 
> >    task_ratelimit = bdi->dirty_ratelimit * pos_ratio         (7)
> > 
> >    where
> > 					    write_bw	
> >   bdi->dirty_ratelimit = task_ratelimit_0 * ---------
> > 					    dirty_bw
> 
> Right.
> 
> >   Because task_ratelimit_0 is initial value to begin with and we will
> >   keep on coming with new value every 200ms, we should be able to write
> >   above as follows.
> > 
> > 						      write_bw
> >   bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
> > 						      dirty_bw
> > 
> >   Effectively we start with an initial value of task_ratelimit_0 and
> >   then keep on updating it based on rate change feedback every 200ms.
> 
> Right.
> 
> >   To summarize,
> > 
> >   We need to achieve (3) for a balanced system. Because we don't know the
> >   value of N in advance, we can use (4) to achieve effect of (3). So we
> >   start with a default value of task_ratelimit_0 and update that every
> >   200ms based on how write and dirty rate on device is changing (8). We also
> >   further refine that rate by pos_ratio so that any variations in number
> >   of dirty pages due to temporary imbalances in the system can be
> >   accounted for (7).
> > 
> > I see that you also use (7). I think only contention point is how
> > (8) is perceived. So can you please explain why do you think that
> > above calculation or (9) is wrong.
> 
> There is no contention point and (9) is right..Sorry it's my fault.
> We are well aligned in the above reasoning :)

Great. Now we are on same page now at least till this point.

> 
> > I can kind of understand that you have done various adjustments to keep the
> > task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that
> > I am not able to understand your calculations in updating bdi->dirty_ratelimit.  
> 
> You mean the below chunk of code? Which is effectively the same as this _one_
> line of code
> 
>         bdi->dirty_ratelimit = balanced_rate;
> 
> except for doing some tricks (conditional update and limiting step size) to
> stabilize bdi->dirty_ratelimit:

I am fine with bdi->dirty_ratelimit being called balanced rate. I am
taking exception to the fact that you are also taking into accout
pos_ratio while coming up with new balanced_rate after 200ms of feedback.

We agreed to updating bdi->dirty_ratelimit as follows (8 above).

 
 						      write_bw
   bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
 						      dirty_bw

I think in your terminology it could be called.
					   write_bw
  new_balanced_rate = prev_balanced_rate * ----------            (9)
					   dirty_bw

But what you seem to be doing is following.
							write_bw
  new_balanced_rate = prev_balanced_rate * pos_ratio * -----------  (10)
							dirty_bw

Of course I have just tried to simlify your actual calculations to
show why I am questioning the presence of pos_ratio while calculating
the new bdi->dirty_ratelimit. I am fine with limiting the step size etc.

So (9) and (10) don't match?

Now going back to your code and show how I arrived at (10).

executed_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT; (11)
balanced_rate = div_u64((u64)executed_rate * bdi->avg_write_bandwidth,
			dirty_rate | 1);			(12)

Combining (11) and (12) gives us (10).
				     write_bw
balance_rate = base_rate * pos_ratio --------
				     dirty_rate

Or
					    write_bw
bdi->dirty_ratelimit = base_rate * pos_ratio --------
					     dirty_rate

To complicate the things you also have the notion of pos_rate and reduce
the step size based on either pos_rate or balance_rate.

pos_rate = executed_rate = base_rate * pos_ratio;

				     write_bw
balance_rate = base_rate * pos_ratio --------
				     dirty_rate

bdi->dirty_rate_limit = min_change(pos_rate, balance_rate)       (13)

So for feedback, why are not sticking to simply (9) and limit the step
size and not take pos_ratio into account. 

Even if you have to take it into account, it needs to be explained clearly
and so many rate definitions confuse things more. Keeping name constant
everywhere (even for local variables), helps understand the code better.

Look at number of rates we have in code and it gets so confusing.

balanced_rate
base_rate
bdi->dirty_ratelimit

executed_rate
pos_rate
task_ratelimit

dirty_rate
write_bw

Here balanced_rate, base_rate and bdi->dirty_ratelimit all seem to be
referring to same thing and that is not obivious from the code. Looks
like task->ratelimit and executed_rate and pos_rate are referring to same
thing.

So instead of 6 rates, we could atleast collpase the naming to 2 rates
to keep the context clear. Just prefix/suffix more strings to highlight
subtle difference between two rates.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-23 13:53                 ` Vivek Goyal
@ 2011-08-24  3:09                   ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-24  3:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 09:53:55PM +0800, Vivek Goyal wrote:
> On Tue, Aug 23, 2011 at 09:07:21AM +0800, Wu Fengguang wrote:
> 
> [..]
> > > > > So we refined the formula for calculating a tasks's effective rate
> > > > > over a period of time to following.
> > > > > 					    write_bw
> > > > > 	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
> > > > > 					    dirty_rate
> > > > > 
> > > > 
> > > > That's not true. It should still be formula (7) when
> > > > balance_drity_pages() considers pos_ratio.
> > > 
> > > Why it is not true? If I do some math, it sounds right. Let me summarize
> > > my understanding again.
> > 
> > Ah sorry! (9) actually holds true, as made clear by your below reasoning.
> > 
> > > - In a steady state stable system, we want dirty_bw = write_bw, IOW.
> > >  
> > >   dirty_bw/write_bw = 1  		(1)
> > > 
> > >   If we can achieve above then that means we are throttling tasks at
> > >   just right rate.
> > > 
> > > Or
> > > -  dirty_bw  == write_bw
> > >    N * task_ratelimit == write_bw
> > >    task_ratelimit =  write_bw/N         (2)
> > > 
> > >   So as long as we can come up with a system where balance_dirty_pages()
> > >   calculates task_ratelimit to be write_bw/N, we should be fine.
> > 
> > Right.
> > 
> > > - But this does not take care of imbalances. So if system goes out of
> > >   balance before feedback loop kicks in and dirty rate shoots up, then
> > >   cache size will grow and number of dirty pages will shoot up. Hence
> > >   we brought in the notion of position ratio where we also vary a 
> > >   tasks's dirty ratelimit based on number of dirty pages. So our
> > >   effective formula became.
> > > 
> > >   task_ratelimit = write_bw/N * pos_ratio     (3)
> > > 
> > >   So as long as we meet (3), we should reach to stable state.
> > 
> > Right.
> > 
> > > -  But here N is unknown in advance so balance_drity_pages() can not make
> > >    use of this formula directly. But write_bw and dirty_bw from previous
> > >    200ms are known. So following can replace (3).
> > > 
> > > 				       write_bw
> > >    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio      (4)
> > > 					dirty_bw	
> > > 
> > >    dirty_bw = task_ratelimit_0 * N                (5)
> > > 
> > >    Substitute (5) in (4)
> > > 
> > >    task_ratelimit = write_bw/N * pos_ratio      (6)
> > > 
> > >    (6) is same as (3) which has been derived from (4) and that means at any
> > >    given point of time (4) can be used by balance_drity_pages() to calculate
> > >    a tasks's throttling rate.
> > 
> > Right. Sorry what's in my mind was
> > 
> >                                        write_bw
> >     balanced_rate = task_ratelimit_0 * --------
> >                                        dirty_bw        
> > 
> >     task_ratelimit = balanced_rate * pos_ratio
> > 
> > which is effective the same to your combined equation (4).
> > 
> > > - Now going back to (4). Because we have a feedback loop where we
> > >   continuously update a previous number based on feedback, we can track
> > >   previous value in bdi->dirty_ratelimit.
> > > 
> > > 				       write_bw
> > >    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio 
> > > 					dirty_bw	
> > > 
> > >    Or
> > > 
> > >    task_ratelimit = bdi->dirty_ratelimit * pos_ratio         (7)
> > > 
> > >    where
> > > 					    write_bw	
> > >   bdi->dirty_ratelimit = task_ratelimit_0 * ---------
> > > 					    dirty_bw
> > 
> > Right.
> > 
> > >   Because task_ratelimit_0 is initial value to begin with and we will
> > >   keep on coming with new value every 200ms, we should be able to write
> > >   above as follows.
> > > 
> > > 						      write_bw
> > >   bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
> > > 						      dirty_bw
> > > 
> > >   Effectively we start with an initial value of task_ratelimit_0 and
> > >   then keep on updating it based on rate change feedback every 200ms.
> > 
> > Right.
> > 
> > >   To summarize,
> > > 
> > >   We need to achieve (3) for a balanced system. Because we don't know the
> > >   value of N in advance, we can use (4) to achieve effect of (3). So we
> > >   start with a default value of task_ratelimit_0 and update that every
> > >   200ms based on how write and dirty rate on device is changing (8). We also
> > >   further refine that rate by pos_ratio so that any variations in number
> > >   of dirty pages due to temporary imbalances in the system can be
> > >   accounted for (7).
> > > 
> > > I see that you also use (7). I think only contention point is how
> > > (8) is perceived. So can you please explain why do you think that
> > > above calculation or (9) is wrong.
> > 
> > There is no contention point and (9) is right..Sorry it's my fault.
> > We are well aligned in the above reasoning :)
> 
> Great. Now we are on same page now at least till this point.
> 
> > 
> > > I can kind of understand that you have done various adjustments to keep the
> > > task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that
> > > I am not able to understand your calculations in updating bdi->dirty_ratelimit.  
> > 
> > You mean the below chunk of code? Which is effectively the same as this _one_
> > line of code
> > 
> >         bdi->dirty_ratelimit = balanced_rate;
> > 
> > except for doing some tricks (conditional update and limiting step size) to
> > stabilize bdi->dirty_ratelimit:
> 
> I am fine with bdi->dirty_ratelimit being called balanced rate. I am
> taking exception to the fact that you are also taking into accout
> pos_ratio while coming up with new balanced_rate after 200ms of feedback.
> 
> We agreed to updating bdi->dirty_ratelimit as follows (8 above).
> 
>  
>  						      write_bw
>    bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
>  						      dirty_bw
> 
> I think in your terminology it could be called.
> 					   write_bw
>   new_balanced_rate = prev_balanced_rate * ----------            (9)
> 					   dirty_bw
> 
> But what you seem to be doing is following.
> 							write_bw
>   new_balanced_rate = prev_balanced_rate * pos_ratio * -----------  (10)
> 							dirty_bw
> 
> Of course I have just tried to simlify your actual calculations to
> show why I am questioning the presence of pos_ratio while calculating
> the new bdi->dirty_ratelimit. I am fine with limiting the step size etc.
> 
> So (9) and (10) don't match?
> 
> Now going back to your code and show how I arrived at (10).
> 
> executed_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT; (11)
> balanced_rate = div_u64((u64)executed_rate * bdi->avg_write_bandwidth,
> 			dirty_rate | 1);			(12)
> 
> Combining (11) and (12) gives us (10).
> 				     write_bw
> balance_rate = base_rate * pos_ratio --------
> 				     dirty_rate
> 
> Or
> 					    write_bw
> bdi->dirty_ratelimit = base_rate * pos_ratio --------
> 					     dirty_rate

I hope the other email on the balanced_rate estimation equation can
clarify the questions on pos_ratio..

> To complicate the things you also have the notion of pos_rate and reduce
> the step size based on either pos_rate or balance_rate.
> 
> pos_rate = executed_rate = base_rate * pos_ratio;
> 
> 				     write_bw
> balance_rate = base_rate * pos_ratio --------
> 				     dirty_rate
> 
> bdi->dirty_rate_limit = min_change(pos_rate, balance_rate)       (13)
> 
> So for feedback, why are not sticking to simply (9) and limit the step
> size and not take pos_ratio into account. 

pos_rate is used to limit the step size. This reply to Peter has more
details:

http://www.spinics.net/lists/linux-fsdevel/msg47991.html

> Even if you have to take it into account, it needs to be explained clearly
> and so many rate definitions confuse things more. Keeping name constant
> everywhere (even for local variables), helps understand the code better.
> 

Good idea! There are two many names that differs subtly..

> Look at number of rates we have in code and it gets so confusing.
> 
> balanced_rate
> base_rate
> bdi->dirty_ratelimit
> 
> executed_rate
> pos_rate
> task_ratelimit
> 
> dirty_rate
> write_bw
> 
> Here balanced_rate, base_rate and bdi->dirty_ratelimit all seem to be
> referring to same thing and that is not obivious from the code. Looks
> like task->ratelimit and executed_rate and pos_rate are referring to same
> thing.

Right.

> So instead of 6 rates, we could atleast collpase the naming to 2 rates
> to keep the context clear. Just prefix/suffix more strings to highlight
> subtle difference between two rates.

How about

  balanced_rate            =>  balanced_dirty_ratelimit
  base_rate                =>  dirty_ratelimit
  bdi->dirty_ratelimit     ==  bdi->dirty_ratelimit

  pos_rate                 =>  task_ratelimit
  executed_rate            =>  task_ratelimit
  task_ratelimit           ==  task_ratelimit

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-24  3:09                   ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-24  3:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 09:53:55PM +0800, Vivek Goyal wrote:
> On Tue, Aug 23, 2011 at 09:07:21AM +0800, Wu Fengguang wrote:
> 
> [..]
> > > > > So we refined the formula for calculating a tasks's effective rate
> > > > > over a period of time to following.
> > > > > 					    write_bw
> > > > > 	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
> > > > > 					    dirty_rate
> > > > > 
> > > > 
> > > > That's not true. It should still be formula (7) when
> > > > balance_drity_pages() considers pos_ratio.
> > > 
> > > Why it is not true? If I do some math, it sounds right. Let me summarize
> > > my understanding again.
> > 
> > Ah sorry! (9) actually holds true, as made clear by your below reasoning.
> > 
> > > - In a steady state stable system, we want dirty_bw = write_bw, IOW.
> > >  
> > >   dirty_bw/write_bw = 1  		(1)
> > > 
> > >   If we can achieve above then that means we are throttling tasks at
> > >   just right rate.
> > > 
> > > Or
> > > -  dirty_bw  == write_bw
> > >    N * task_ratelimit == write_bw
> > >    task_ratelimit =  write_bw/N         (2)
> > > 
> > >   So as long as we can come up with a system where balance_dirty_pages()
> > >   calculates task_ratelimit to be write_bw/N, we should be fine.
> > 
> > Right.
> > 
> > > - But this does not take care of imbalances. So if system goes out of
> > >   balance before feedback loop kicks in and dirty rate shoots up, then
> > >   cache size will grow and number of dirty pages will shoot up. Hence
> > >   we brought in the notion of position ratio where we also vary a 
> > >   tasks's dirty ratelimit based on number of dirty pages. So our
> > >   effective formula became.
> > > 
> > >   task_ratelimit = write_bw/N * pos_ratio     (3)
> > > 
> > >   So as long as we meet (3), we should reach to stable state.
> > 
> > Right.
> > 
> > > -  But here N is unknown in advance so balance_drity_pages() can not make
> > >    use of this formula directly. But write_bw and dirty_bw from previous
> > >    200ms are known. So following can replace (3).
> > > 
> > > 				       write_bw
> > >    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio      (4)
> > > 					dirty_bw	
> > > 
> > >    dirty_bw = task_ratelimit_0 * N                (5)
> > > 
> > >    Substitute (5) in (4)
> > > 
> > >    task_ratelimit = write_bw/N * pos_ratio      (6)
> > > 
> > >    (6) is same as (3) which has been derived from (4) and that means at any
> > >    given point of time (4) can be used by balance_drity_pages() to calculate
> > >    a tasks's throttling rate.
> > 
> > Right. Sorry what's in my mind was
> > 
> >                                        write_bw
> >     balanced_rate = task_ratelimit_0 * --------
> >                                        dirty_bw        
> > 
> >     task_ratelimit = balanced_rate * pos_ratio
> > 
> > which is effective the same to your combined equation (4).
> > 
> > > - Now going back to (4). Because we have a feedback loop where we
> > >   continuously update a previous number based on feedback, we can track
> > >   previous value in bdi->dirty_ratelimit.
> > > 
> > > 				       write_bw
> > >    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio 
> > > 					dirty_bw	
> > > 
> > >    Or
> > > 
> > >    task_ratelimit = bdi->dirty_ratelimit * pos_ratio         (7)
> > > 
> > >    where
> > > 					    write_bw	
> > >   bdi->dirty_ratelimit = task_ratelimit_0 * ---------
> > > 					    dirty_bw
> > 
> > Right.
> > 
> > >   Because task_ratelimit_0 is initial value to begin with and we will
> > >   keep on coming with new value every 200ms, we should be able to write
> > >   above as follows.
> > > 
> > > 						      write_bw
> > >   bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
> > > 						      dirty_bw
> > > 
> > >   Effectively we start with an initial value of task_ratelimit_0 and
> > >   then keep on updating it based on rate change feedback every 200ms.
> > 
> > Right.
> > 
> > >   To summarize,
> > > 
> > >   We need to achieve (3) for a balanced system. Because we don't know the
> > >   value of N in advance, we can use (4) to achieve effect of (3). So we
> > >   start with a default value of task_ratelimit_0 and update that every
> > >   200ms based on how write and dirty rate on device is changing (8). We also
> > >   further refine that rate by pos_ratio so that any variations in number
> > >   of dirty pages due to temporary imbalances in the system can be
> > >   accounted for (7).
> > > 
> > > I see that you also use (7). I think only contention point is how
> > > (8) is perceived. So can you please explain why do you think that
> > > above calculation or (9) is wrong.
> > 
> > There is no contention point and (9) is right..Sorry it's my fault.
> > We are well aligned in the above reasoning :)
> 
> Great. Now we are on same page now at least till this point.
> 
> > 
> > > I can kind of understand that you have done various adjustments to keep the
> > > task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that
> > > I am not able to understand your calculations in updating bdi->dirty_ratelimit.  
> > 
> > You mean the below chunk of code? Which is effectively the same as this _one_
> > line of code
> > 
> >         bdi->dirty_ratelimit = balanced_rate;
> > 
> > except for doing some tricks (conditional update and limiting step size) to
> > stabilize bdi->dirty_ratelimit:
> 
> I am fine with bdi->dirty_ratelimit being called balanced rate. I am
> taking exception to the fact that you are also taking into accout
> pos_ratio while coming up with new balanced_rate after 200ms of feedback.
> 
> We agreed to updating bdi->dirty_ratelimit as follows (8 above).
> 
>  
>  						      write_bw
>    bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
>  						      dirty_bw
> 
> I think in your terminology it could be called.
> 					   write_bw
>   new_balanced_rate = prev_balanced_rate * ----------            (9)
> 					   dirty_bw
> 
> But what you seem to be doing is following.
> 							write_bw
>   new_balanced_rate = prev_balanced_rate * pos_ratio * -----------  (10)
> 							dirty_bw
> 
> Of course I have just tried to simlify your actual calculations to
> show why I am questioning the presence of pos_ratio while calculating
> the new bdi->dirty_ratelimit. I am fine with limiting the step size etc.
> 
> So (9) and (10) don't match?
> 
> Now going back to your code and show how I arrived at (10).
> 
> executed_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT; (11)
> balanced_rate = div_u64((u64)executed_rate * bdi->avg_write_bandwidth,
> 			dirty_rate | 1);			(12)
> 
> Combining (11) and (12) gives us (10).
> 				     write_bw
> balance_rate = base_rate * pos_ratio --------
> 				     dirty_rate
> 
> Or
> 					    write_bw
> bdi->dirty_ratelimit = base_rate * pos_ratio --------
> 					     dirty_rate

I hope the other email on the balanced_rate estimation equation can
clarify the questions on pos_ratio..

> To complicate the things you also have the notion of pos_rate and reduce
> the step size based on either pos_rate or balance_rate.
> 
> pos_rate = executed_rate = base_rate * pos_ratio;
> 
> 				     write_bw
> balance_rate = base_rate * pos_ratio --------
> 				     dirty_rate
> 
> bdi->dirty_rate_limit = min_change(pos_rate, balance_rate)       (13)
> 
> So for feedback, why are not sticking to simply (9) and limit the step
> size and not take pos_ratio into account. 

pos_rate is used to limit the step size. This reply to Peter has more
details:

http://www.spinics.net/lists/linux-fsdevel/msg47991.html

> Even if you have to take it into account, it needs to be explained clearly
> and so many rate definitions confuse things more. Keeping name constant
> everywhere (even for local variables), helps understand the code better.
> 

Good idea! There are two many names that differs subtly..

> Look at number of rates we have in code and it gets so confusing.
> 
> balanced_rate
> base_rate
> bdi->dirty_ratelimit
> 
> executed_rate
> pos_rate
> task_ratelimit
> 
> dirty_rate
> write_bw
> 
> Here balanced_rate, base_rate and bdi->dirty_ratelimit all seem to be
> referring to same thing and that is not obivious from the code. Looks
> like task->ratelimit and executed_rate and pos_rate are referring to same
> thing.

Right.

> So instead of 6 rates, we could atleast collpase the naming to 2 rates
> to keep the context clear. Just prefix/suffix more strings to highlight
> subtle difference between two rates.

How about

  balanced_rate            =>  balanced_dirty_ratelimit
  base_rate                =>  dirty_ratelimit
  bdi->dirty_ratelimit     ==  bdi->dirty_ratelimit

  pos_rate                 =>  task_ratelimit
  executed_rate            =>  task_ratelimit
  task_ratelimit           ==  task_ratelimit

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-17 20:24         ` Jan Kara
@ 2011-08-24  3:16           ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-24  3:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> > > > +	x_intercept = setpoint + 2 * span;
>    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> ~3*bdi_thresh... So maybe you should use bdi_thresh/2 in the computation of
> span?

OK, I'll follow your suggestion to use

        span = 8 * write_bw, for single bdi case 
        span = bdi_thresh, for JBOD case
        x_intercept = setpoint + span;

It does make sense to squeeze the bdi_dirty fluctuation range a bit by
doubling span and making the control line more sharp.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-24  3:16           ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-24  3:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> > > > +	x_intercept = setpoint + 2 * span;
>    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> ~3*bdi_thresh... So maybe you should use bdi_thresh/2 in the computation of
> span?

OK, I'll follow your suggestion to use

        span = 8 * write_bw, for single bdi case 
        span = bdi_thresh, for JBOD case
        x_intercept = setpoint + span;

It does make sense to squeeze the bdi_dirty fluctuation range a bit by
doubling span and making the control line more sharp.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-29 13:37                                 ` Wu Fengguang
@ 2011-09-06 12:40                                   ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-09-06 12:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-09-02 at 14:16 +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-29 at 21:37 +0800, Wu Fengguang wrote:
> > > 
> > > Ok so this argument makes sense, is there some formalism to describe
> > > such systems where such things are more evident?
> > 
> > I find the most easy and clean way to describe it is,
> > 
> > (1) the below formula
> >                                                           write_bw  
> >     bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> >                                                           dirty_bw
> > is able to yield
> > 
> >     dirty_ratelimit_(i) ~= (write_bw / N)
> > 
> > as long as
> > 
> > - write_bw, dirty_bw and pos_ratio are not changing rapidly
> > - dirty pages are not around @freerun or @limit
> > 
> > Otherwise there will be larger estimation errors.
> > 
> > (2) based on (1), we get
> > 
> >     task_ratelimit ~= (write_bw / N) * pos_ratio
> > 
> > So the pos_ratio feedback is able to drive dirty count to the
> > setpoint, where pos_ratio = 1.
> > 
> > That interpretation based on _real values_ can neatly decouple the two
> > feedback loops :) It makes full utilization of the fact "the
> > dirty_ratelimit _value_ is independent on pos_ratio except for
> > possible impacts on estimation errors". 
> 
> OK, so the 'problem' I have with this is that the whole control thing
> really doesn't care about N. All it does is measure:
> 
>  - dirty rate
>  - writeback rate
> 
> observe:
> 
>  - dirty count; with the independent input of its setpoint
> 
> control:
> 
>  - ratelimit
> 
> so I was looking for a way to describe the interaction between the two
> feedback loops without involving the exact details of what they're
> controlling, but that might just end up being an oxymoron.


Hmm, so per Vivek's argument the system without pos_ratio in the
feedback term isn't convergent. Therefore we should be able to argue
from convergent/stability grounds that this term is indeed needed.

Does the stability proof of a control system need the model of what its
controlling? I guess I ought to go get a book on this or so.




^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-09-06 12:40                                   ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-09-06 12:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-09-02 at 14:16 +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-29 at 21:37 +0800, Wu Fengguang wrote:
> > > 
> > > Ok so this argument makes sense, is there some formalism to describe
> > > such systems where such things are more evident?
> > 
> > I find the most easy and clean way to describe it is,
> > 
> > (1) the below formula
> >                                                           write_bw  
> >     bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> >                                                           dirty_bw
> > is able to yield
> > 
> >     dirty_ratelimit_(i) ~= (write_bw / N)
> > 
> > as long as
> > 
> > - write_bw, dirty_bw and pos_ratio are not changing rapidly
> > - dirty pages are not around @freerun or @limit
> > 
> > Otherwise there will be larger estimation errors.
> > 
> > (2) based on (1), we get
> > 
> >     task_ratelimit ~= (write_bw / N) * pos_ratio
> > 
> > So the pos_ratio feedback is able to drive dirty count to the
> > setpoint, where pos_ratio = 1.
> > 
> > That interpretation based on _real values_ can neatly decouple the two
> > feedback loops :) It makes full utilization of the fact "the
> > dirty_ratelimit _value_ is independent on pos_ratio except for
> > possible impacts on estimation errors". 
> 
> OK, so the 'problem' I have with this is that the whole control thing
> really doesn't care about N. All it does is measure:
> 
>  - dirty rate
>  - writeback rate
> 
> observe:
> 
>  - dirty count; with the independent input of its setpoint
> 
> control:
> 
>  - ratelimit
> 
> so I was looking for a way to describe the interaction between the two
> feedback loops without involving the exact details of what they're
> controlling, but that might just end up being an oxymoron.


Hmm, so per Vivek's argument the system without pos_ratio in the
feedback term isn't convergent. Therefore we should be able to argue
from convergent/stability grounds that this term is indeed needed.

Does the stability proof of a control system need the model of what its
controlling? I guess I ought to go get a book on this or so.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-29 13:37                                 ` Wu Fengguang
@ 2011-09-02 12:16                                   ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-09-02 12:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-29 at 21:37 +0800, Wu Fengguang wrote:
> > 
> > Ok so this argument makes sense, is there some formalism to describe
> > such systems where such things are more evident?
> 
> I find the most easy and clean way to describe it is,
> 
> (1) the below formula
>                                                           write_bw  
>     bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
>                                                           dirty_bw
> is able to yield
> 
>     dirty_ratelimit_(i) ~= (write_bw / N)
> 
> as long as
> 
> - write_bw, dirty_bw and pos_ratio are not changing rapidly
> - dirty pages are not around @freerun or @limit
> 
> Otherwise there will be larger estimation errors.
> 
> (2) based on (1), we get
> 
>     task_ratelimit ~= (write_bw / N) * pos_ratio
> 
> So the pos_ratio feedback is able to drive dirty count to the
> setpoint, where pos_ratio = 1.
> 
> That interpretation based on _real values_ can neatly decouple the two
> feedback loops :) It makes full utilization of the fact "the
> dirty_ratelimit _value_ is independent on pos_ratio except for
> possible impacts on estimation errors". 

OK, so the 'problem' I have with this is that the whole control thing
really doesn't care about N. All it does is measure:

 - dirty rate
 - writeback rate

observe:

 - dirty count; with the independent input of its setpoint

control:

 - ratelimit

so I was looking for a way to describe the interaction between the two
feedback loops without involving the exact details of what they're
controlling, but that might just end up being an oxymoron.



^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-09-02 12:16                                   ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-09-02 12:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-29 at 21:37 +0800, Wu Fengguang wrote:
> > 
> > Ok so this argument makes sense, is there some formalism to describe
> > such systems where such things are more evident?
> 
> I find the most easy and clean way to describe it is,
> 
> (1) the below formula
>                                                           write_bw  
>     bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
>                                                           dirty_bw
> is able to yield
> 
>     dirty_ratelimit_(i) ~= (write_bw / N)
> 
> as long as
> 
> - write_bw, dirty_bw and pos_ratio are not changing rapidly
> - dirty pages are not around @freerun or @limit
> 
> Otherwise there will be larger estimation errors.
> 
> (2) based on (1), we get
> 
>     task_ratelimit ~= (write_bw / N) * pos_ratio
> 
> So the pos_ratio feedback is able to drive dirty count to the
> setpoint, where pos_ratio = 1.
> 
> That interpretation based on _real values_ can neatly decouple the two
> feedback loops :) It makes full utilization of the fact "the
> dirty_ratelimit _value_ is independent on pos_ratio except for
> possible impacts on estimation errors". 

OK, so the 'problem' I have with this is that the whole control thing
really doesn't care about N. All it does is measure:

 - dirty rate
 - writeback rate

observe:

 - dirty count; with the independent input of its setpoint

control:

 - ratelimit

so I was looking for a way to describe the interaction between the two
feedback loops without involving the exact details of what they're
controlling, but that might just end up being an oxymoron.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-29 13:12                               ` Peter Zijlstra
@ 2011-08-29 13:37                                 ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-29 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 29, 2011 at 09:12:07PM +0800, Peter Zijlstra wrote:
> On Wed, 2011-08-24 at 14:00 -0400, Vivek Goyal wrote:
> > 
> > Ok, I think I am beginning to see your point. Let me just elaborate on
> > the example you gave.
> > 
> > Assume a system is completely balanced and a task is writing at 100MB/s
> > rate.
> > 
> > write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1
> > 
> > bdi->dirty_ratelimit = 100MB/s
> > 
> > Now another tasks starts dirtying the page cache on same bdi. Number of 
> > dirty pages should go up pretty fast and likely position ratio feedback
> > will kick in to reduce the dirtying rate. (rate based feedback does not
> > kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.
> > Assume new pos_ratio is .5
> > 
> > So new throttle rate for both the tasks is 50MB/s.
> > 
> > bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
> > task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s
> > 
> > Now lets say 200ms have passed and rate base feedback is reevaluated.
> > 
> >                                                       write_bw  
> > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
> >                                                       dirty_bw
> > 
> > bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s
> > 
> > Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
> > that did not happen. And reason being that there are two feedback control
> > loops and pos_ratio loops reacts to imbalances much more quickly. Because
> > previous loop has already reacted to the imbalance and reduced the
> > dirtying rate of task, rate based loop does not try to adjust anything
> > and thinks everything is just fine.
> > 
> > Things are fine in the sense that still dirty_rate == write_bw but
> > system is not balanced in terms of number of dirty pages and pos_ratio=.5
> > 
> > So you are trying to make one feedback loop aware of second loop so that
> > if second loop is unbalanced, first loop reacts to that as well and not
> > just look at dirty_rate and write_bw. So refining new balanced rate by
> > pos_ratio helps.
> >                                                       write_bw  
> > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> >                                                       dirty_bw
> > 
> > Now if global dirty pages are imbalanced, balanced rate will still go
> > down despite the fact that dirty_bw == write_bw. This will lead to
> > further reduction in task dirty rate. Which in turn will lead to reduced
> > number of dirty rate and should eventually lead to pos_ratio=1.
> 
> 
> Ok so this argument makes sense, is there some formalism to describe
> such systems where such things are more evident?

I find the most easy and clean way to describe it is,

(1) the below formula
                                                          write_bw  
    bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
                                                          dirty_bw
is able to yield

    dirty_ratelimit_(i) ~= (write_bw / N)

as long as

- write_bw, dirty_bw and pos_ratio are not changing rapidly
- dirty pages are not around @freerun or @limit

Otherwise there will be larger estimation errors.

(2) based on (1), we get

    task_ratelimit ~= (write_bw / N) * pos_ratio

So the pos_ratio feedback is able to drive dirty count to the
setpoint, where pos_ratio = 1.

That interpretation based on _real values_ can neatly decouple the two
feedback loops :) It makes full utilization of the fact "the
dirty_ratelimit _value_ is independent on pos_ratio except for
possible impacts on estimation errors".

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-29 13:37                                 ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-29 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 29, 2011 at 09:12:07PM +0800, Peter Zijlstra wrote:
> On Wed, 2011-08-24 at 14:00 -0400, Vivek Goyal wrote:
> > 
> > Ok, I think I am beginning to see your point. Let me just elaborate on
> > the example you gave.
> > 
> > Assume a system is completely balanced and a task is writing at 100MB/s
> > rate.
> > 
> > write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1
> > 
> > bdi->dirty_ratelimit = 100MB/s
> > 
> > Now another tasks starts dirtying the page cache on same bdi. Number of 
> > dirty pages should go up pretty fast and likely position ratio feedback
> > will kick in to reduce the dirtying rate. (rate based feedback does not
> > kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.
> > Assume new pos_ratio is .5
> > 
> > So new throttle rate for both the tasks is 50MB/s.
> > 
> > bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
> > task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s
> > 
> > Now lets say 200ms have passed and rate base feedback is reevaluated.
> > 
> >                                                       write_bw  
> > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
> >                                                       dirty_bw
> > 
> > bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s
> > 
> > Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
> > that did not happen. And reason being that there are two feedback control
> > loops and pos_ratio loops reacts to imbalances much more quickly. Because
> > previous loop has already reacted to the imbalance and reduced the
> > dirtying rate of task, rate based loop does not try to adjust anything
> > and thinks everything is just fine.
> > 
> > Things are fine in the sense that still dirty_rate == write_bw but
> > system is not balanced in terms of number of dirty pages and pos_ratio=.5
> > 
> > So you are trying to make one feedback loop aware of second loop so that
> > if second loop is unbalanced, first loop reacts to that as well and not
> > just look at dirty_rate and write_bw. So refining new balanced rate by
> > pos_ratio helps.
> >                                                       write_bw  
> > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> >                                                       dirty_bw
> > 
> > Now if global dirty pages are imbalanced, balanced rate will still go
> > down despite the fact that dirty_bw == write_bw. This will lead to
> > further reduction in task dirty rate. Which in turn will lead to reduced
> > number of dirty rate and should eventually lead to pos_ratio=1.
> 
> 
> Ok so this argument makes sense, is there some formalism to describe
> such systems where such things are more evident?

I find the most easy and clean way to describe it is,

(1) the below formula
                                                          write_bw  
    bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
                                                          dirty_bw
is able to yield

    dirty_ratelimit_(i) ~= (write_bw / N)

as long as

- write_bw, dirty_bw and pos_ratio are not changing rapidly
- dirty pages are not around @freerun or @limit

Otherwise there will be larger estimation errors.

(2) based on (1), we get

    task_ratelimit ~= (write_bw / N) * pos_ratio

So the pos_ratio feedback is able to drive dirty count to the
setpoint, where pos_ratio = 1.

That interpretation based on _real values_ can neatly decouple the two
feedback loops :) It makes full utilization of the fact "the
dirty_ratelimit _value_ is independent on pos_ratio except for
possible impacts on estimation errors".

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-24 18:00                             ` Vivek Goyal
@ 2011-08-29 13:12                               ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-29 13:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-08-24 at 14:00 -0400, Vivek Goyal wrote:
> 
> Ok, I think I am beginning to see your point. Let me just elaborate on
> the example you gave.
> 
> Assume a system is completely balanced and a task is writing at 100MB/s
> rate.
> 
> write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1
> 
> bdi->dirty_ratelimit = 100MB/s
> 
> Now another tasks starts dirtying the page cache on same bdi. Number of 
> dirty pages should go up pretty fast and likely position ratio feedback
> will kick in to reduce the dirtying rate. (rate based feedback does not
> kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.
> Assume new pos_ratio is .5
> 
> So new throttle rate for both the tasks is 50MB/s.
> 
> bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
> task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s
> 
> Now lets say 200ms have passed and rate base feedback is reevaluated.
> 
>                                                       write_bw  
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
>                                                       dirty_bw
> 
> bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s
> 
> Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
> that did not happen. And reason being that there are two feedback control
> loops and pos_ratio loops reacts to imbalances much more quickly. Because
> previous loop has already reacted to the imbalance and reduced the
> dirtying rate of task, rate based loop does not try to adjust anything
> and thinks everything is just fine.
> 
> Things are fine in the sense that still dirty_rate == write_bw but
> system is not balanced in terms of number of dirty pages and pos_ratio=.5
> 
> So you are trying to make one feedback loop aware of second loop so that
> if second loop is unbalanced, first loop reacts to that as well and not
> just look at dirty_rate and write_bw. So refining new balanced rate by
> pos_ratio helps.
>                                                       write_bw  
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
>                                                       dirty_bw
> 
> Now if global dirty pages are imbalanced, balanced rate will still go
> down despite the fact that dirty_bw == write_bw. This will lead to
> further reduction in task dirty rate. Which in turn will lead to reduced
> number of dirty rate and should eventually lead to pos_ratio=1.


Ok so this argument makes sense, is there some formalism to describe
such systems where such things are more evident?



^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-29 13:12                               ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-29 13:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-08-24 at 14:00 -0400, Vivek Goyal wrote:
> 
> Ok, I think I am beginning to see your point. Let me just elaborate on
> the example you gave.
> 
> Assume a system is completely balanced and a task is writing at 100MB/s
> rate.
> 
> write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1
> 
> bdi->dirty_ratelimit = 100MB/s
> 
> Now another tasks starts dirtying the page cache on same bdi. Number of 
> dirty pages should go up pretty fast and likely position ratio feedback
> will kick in to reduce the dirtying rate. (rate based feedback does not
> kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.
> Assume new pos_ratio is .5
> 
> So new throttle rate for both the tasks is 50MB/s.
> 
> bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
> task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s
> 
> Now lets say 200ms have passed and rate base feedback is reevaluated.
> 
>                                                       write_bw  
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
>                                                       dirty_bw
> 
> bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s
> 
> Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
> that did not happen. And reason being that there are two feedback control
> loops and pos_ratio loops reacts to imbalances much more quickly. Because
> previous loop has already reacted to the imbalance and reduced the
> dirtying rate of task, rate based loop does not try to adjust anything
> and thinks everything is just fine.
> 
> Things are fine in the sense that still dirty_rate == write_bw but
> system is not balanced in terms of number of dirty pages and pos_ratio=.5
> 
> So you are trying to make one feedback loop aware of second loop so that
> if second loop is unbalanced, first loop reacts to that as well and not
> just look at dirty_rate and write_bw. So refining new balanced rate by
> pos_ratio helps.
>                                                       write_bw  
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
>                                                       dirty_bw
> 
> Now if global dirty pages are imbalanced, balanced rate will still go
> down despite the fact that dirty_bw == write_bw. This will lead to
> further reduction in task dirty rate. Which in turn will lead to reduced
> number of dirty rate and should eventually lead to pos_ratio=1.


Ok so this argument makes sense, is there some formalism to describe
such systems where such things are more evident?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 13:18                                             ` Peter Zijlstra
@ 2011-08-26 13:24                                               ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-26 13:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 09:18:21PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 21:13 +0800, Wu Fengguang wrote:
> > We got similar result as in the read disturber case, even though one
> > disturbs N and the other impacts writeout bandwith.  The original
> > patchset is consistently performing much better :) 
> 
> It does indeed, and I figure on these timescales it makes sense to
> assumes N is a constant. Fair enough, thanks!

Thank you! Glad that we finally reaches some consensus :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26 13:24                                               ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-26 13:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 09:18:21PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 21:13 +0800, Wu Fengguang wrote:
> > We got similar result as in the read disturber case, even though one
> > disturbs N and the other impacts writeout bandwith.  The original
> > patchset is consistently performing much better :) 
> 
> It does indeed, and I figure on these timescales it makes sense to
> assumes N is a constant. Fair enough, thanks!

Thank you! Glad that we finally reaches some consensus :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 13:13                                         ` Wu Fengguang
@ 2011-08-26 13:18                                             ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-26 13:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 21:13 +0800, Wu Fengguang wrote:
> We got similar result as in the read disturber case, even though one
> disturbs N and the other impacts writeout bandwith.  The original
> patchset is consistently performing much better :) 

It does indeed, and I figure on these timescales it makes sense to
assumes N is a constant. Fair enough, thanks!

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26 13:18                                             ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-26 13:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 21:13 +0800, Wu Fengguang wrote:
> We got similar result as in the read disturber case, even though one
> disturbs N and the other impacts writeout bandwith.  The original
> patchset is consistently performing much better :) 

It does indeed, and I figure on these timescales it makes sense to
assumes N is a constant. Fair enough, thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 12:20                                         ` Wu Fengguang
  (?)
@ 2011-08-26 13:13                                         ` Wu Fengguang
  2011-08-26 13:18                                             ` Peter Zijlstra
  -1 siblings, 1 reply; 203+ messages in thread
From: Wu Fengguang @ 2011-08-26 13:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 845 bytes --]

On Fri, Aug 26, 2011 at 08:20:57PM +0800, Wu Fengguang wrote:
> On Fri, Aug 26, 2011 at 08:11:50PM +0800, Peter Zijlstra wrote:
> > On Fri, 2011-08-26 at 19:26 +0800, Wu Fengguang wrote:
> > > Now I get 3 figures. Test case is: run 1 dd write task for 300s, with
> > > a "disturber" dd read task during roughly 120-130s. 
> > 
> > Ah, but ideally the disturber task should run in bursts of 100ms
> > (<feedback period), otherwise your N is indeed mostly constant.
> 
> Ah yeah, the disturber task should be a dd writer! Then we get
> 
> - 120s: N=1 => N=2
> - 130s: N=2 => N=1

Here they are. The write disturber starts/stops around 150s.

We got similar result as in the read disturber case, even though one
disturbs N and the other impacts writeout bandwith.  The original
patchset is consistently performing much better :)

Thanks,
Fengguang

[-- Attachment #2: balance_dirty_pages-pages.png --]
[-- Type: image/png, Size: 120914 bytes --]

[-- Attachment #3: balance_dirty_pages-pages_pure-rate-feedback.png --]
[-- Type: image/png, Size: 142966 bytes --]

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 12:11                                       ` Peter Zijlstra
@ 2011-08-26 12:20                                         ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-26 12:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 08:11:50PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 19:26 +0800, Wu Fengguang wrote:
> > Now I get 3 figures. Test case is: run 1 dd write task for 300s, with
> > a "disturber" dd read task during roughly 120-130s. 
> 
> Ah, but ideally the disturber task should run in bursts of 100ms
> (<feedback period), otherwise your N is indeed mostly constant.

Ah yeah, the disturber task should be a dd writer! Then we get

- 120s: N=1 => N=2
- 130s: N=2 => N=1

I'll try it right away.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26 12:20                                         ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-26 12:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 08:11:50PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 19:26 +0800, Wu Fengguang wrote:
> > Now I get 3 figures. Test case is: run 1 dd write task for 300s, with
> > a "disturber" dd read task during roughly 120-130s. 
> 
> Ah, but ideally the disturber task should run in bursts of 100ms
> (<feedback period), otherwise your N is indeed mostly constant.

Ah yeah, the disturber task should be a dd writer! Then we get

- 120s: N=1 => N=2
- 130s: N=2 => N=1

I'll try it right away.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 11:26                                   ` Wu Fengguang
@ 2011-08-26 12:11                                       ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-26 12:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 19:26 +0800, Wu Fengguang wrote:
> Now I get 3 figures. Test case is: run 1 dd write task for 300s, with
> a "disturber" dd read task during roughly 120-130s. 

Ah, but ideally the disturber task should run in bursts of 100ms
(<feedback period), otherwise your N is indeed mostly constant.



^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26 12:11                                       ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-26 12:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 19:26 +0800, Wu Fengguang wrote:
> Now I get 3 figures. Test case is: run 1 dd write task for 300s, with
> a "disturber" dd read task during roughly 120-130s. 

Ah, but ideally the disturber task should run in bursts of 100ms
(<feedback period), otherwise your N is indeed mostly constant.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 10:04                                   ` Wu Fengguang
  (?)
  (?)
@ 2011-08-26 11:26                                   ` Wu Fengguang
  2011-08-26 12:11                                       ` Peter Zijlstra
  -1 siblings, 1 reply; 203+ messages in thread
From: Wu Fengguang @ 2011-08-26 11:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 1633 bytes --]

Peter,

Now I get 3 figures. Test case is: run 1 dd write task for 300s, with
a "disturber" dd read task during roughly 120-130s.

(1) balance_dirty_pages-pages.png

This is the output of the original patchset. Here the "balanced
ratelimit" dots are mostly accurate except when near @freerun or @limit.

(2) balance_dirty_pages-pages_pure-rate-feedback.png

do this change:
  -	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
  +	balanced_dirty_ratelimit = div_u64((u64)dirty_ratelimit * write_bw,
   					   dirty_rate | 1);

Here the "balanced ratelimit" dots goto the opposite direction
comparing to "pos ratelimit", which is the expected result discussed
in the other email. Then the system got stuck in unbalanced dirty
position.  It's slowly moving towards the setpoint thanks to the
dirty_ratelimit update policy: it only updates dirty_ratelimit when
balanced_dirty_ratelimit fluctuates to the same side of
task_ratelimit, hence introduced some systematical "errors" in the
right direction ;)

(3) balance_dirty_pages-pages_pure-rate-feedback-without-dirty_ratelimit-update-constraints.png

further remove the "do conservative bdi->dirty_ratelimit updates"
feature, by replacing its update policy with a direct assignment:

        bdi->dirty_ratelimit = max(balanced_dirty_ratelimit, 1UL);

This is to check if dirty_ratelimit can still go back to the balance
point without the help of the dirty_ratelimit update policy. To my
surprise, dirty_ratelimit jumps to HUGE singular value and shows no
sign to come back to normal..

In summary, the original patchset shows the best behavior :)

Thanks,
Fengguang

[-- Attachment #2: balance_dirty_pages-pages.png --]
[-- Type: image/png, Size: 75688 bytes --]

[-- Attachment #3: balance_dirty_pages-pages_pure-rate-feedback.png --]
[-- Type: image/png, Size: 83327 bytes --]

[-- Attachment #4: balance_dirty_pages-pages_pure-rate-feedback-without-dirty_ratelimit-update-constraints.png --]
[-- Type: image/png, Size: 63923 bytes --]

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 10:42                                     ` Peter Zijlstra
@ 2011-08-26 10:52                                       ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-26 10:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 06:42:22PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 18:04 +0800, Wu Fengguang wrote:
> > Sorry I'm now feeling lost...
> 
> hehe welcome to my world ;-)

Yeah, so sorry...

> Seriously though, I appreciate all the effort you put in trying to
> explain things. I feel I do understand things now, although I might not
> completely agree with them quite yet ;-)

Thank you :)

> I'll go read the v9 patch-set you send out and look at some of the
> details (such as pos_ratio being comprised of both global and bdi
> limits, which so far has been somewhat glossed over).

Hold on please! I'll immediately post a v10 with all the comment updates.

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26 10:52                                       ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-26 10:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 06:42:22PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 18:04 +0800, Wu Fengguang wrote:
> > Sorry I'm now feeling lost...
> 
> hehe welcome to my world ;-)

Yeah, so sorry...

> Seriously though, I appreciate all the effort you put in trying to
> explain things. I feel I do understand things now, although I might not
> completely agree with them quite yet ;-)

Thank you :)

> I'll go read the v9 patch-set you send out and look at some of the
> details (such as pos_ratio being comprised of both global and bdi
> limits, which so far has been somewhat glossed over).

Hold on please! I'll immediately post a v10 with all the comment updates.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 10:04                                   ` Wu Fengguang
@ 2011-08-26 10:42                                     ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-26 10:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 18:04 +0800, Wu Fengguang wrote:
> Sorry I'm now feeling lost...

hehe welcome to my world ;-)

Seriously though, I appreciate all the effort you put in trying to
explain things. I feel I do understand things now, although I might not
completely agree with them quite yet ;-)

I'll go read the v9 patch-set you send out and look at some of the
details (such as pos_ratio being comprised of both global and bdi
limits, which so far has been somewhat glossed over).



^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26 10:42                                     ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-26 10:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 18:04 +0800, Wu Fengguang wrote:
> Sorry I'm now feeling lost...

hehe welcome to my world ;-)

Seriously though, I appreciate all the effort you put in trying to
explain things. I feel I do understand things now, although I might not
completely agree with them quite yet ;-)

I'll go read the v9 patch-set you send out and look at some of the
details (such as pos_ratio being comprised of both global and bdi
limits, which so far has been somewhat glossed over).


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26  9:04                                 ` Peter Zijlstra
@ 2011-08-26 10:04                                   ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-26 10:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 05:04:29PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 08:18 +0800, Wu Fengguang wrote:
> > On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote:
> > > On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:
> 
> > > > Put (6) into (4), we get
> > > > 
> > > >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> > > >                             = (write_bw / N) * 2
> > > > 
> > > > That means, any position imbalance will lead to balanced_rate
> > > > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > > > always get the right balanced dirty ratelimit value whether or not
> > > > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > > > dirty position control.
> > > > 
> > > > (*) independent as in real values, not the seemingly relations in equation
> > > 
> > > 
> > > The assumption here is that N is a constant.. in the above case
> > > pos_ratio would eventually end up at 1 and things would be good again. I
> > > see your argument about oscillations, but I think you can introduce
> > > similar effects by varying N.
> > 
> > Yeah, it's very possible for N to change over time, in which case
> > balanced_rate will adapt to new N in similar way.
> 
> Gah.. but but but, that gives the same stuff as your (6)+(4). Why won't
> you accept that for pos_ratio but you don't mind for N ?

Sorry I'm now feeling lost...anyway it's convenient to try out the
pure rate feedback. And the test case exactly includes the sudden
change of N.

I'm now running the tests with this trivial patch:

--- linux-next.orig/mm/page-writeback.c	2011-08-26 17:58:01.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-26 17:59:06.000000000 +0800
@@ -800,7 +800,7 @@ static void bdi_update_dirty_ratelimit(s
 	 * the dirty count meet the setpoint, but also where the slope of
 	 * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
 	 */
-	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
+	balanced_dirty_ratelimit = div_u64((u64)dirty_ratelimit * write_bw,
 					   dirty_rate | 1);
 
 	/*

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26 10:04                                   ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-26 10:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 05:04:29PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 08:18 +0800, Wu Fengguang wrote:
> > On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote:
> > > On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:
> 
> > > > Put (6) into (4), we get
> > > > 
> > > >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> > > >                             = (write_bw / N) * 2
> > > > 
> > > > That means, any position imbalance will lead to balanced_rate
> > > > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > > > always get the right balanced dirty ratelimit value whether or not
> > > > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > > > dirty position control.
> > > > 
> > > > (*) independent as in real values, not the seemingly relations in equation
> > > 
> > > 
> > > The assumption here is that N is a constant.. in the above case
> > > pos_ratio would eventually end up at 1 and things would be good again. I
> > > see your argument about oscillations, but I think you can introduce
> > > similar effects by varying N.
> > 
> > Yeah, it's very possible for N to change over time, in which case
> > balanced_rate will adapt to new N in similar way.
> 
> Gah.. but but but, that gives the same stuff as your (6)+(4). Why won't
> you accept that for pos_ratio but you don't mind for N ?

Sorry I'm now feeling lost...anyway it's convenient to try out the
pure rate feedback. And the test case exactly includes the sudden
change of N.

I'm now running the tests with this trivial patch:

--- linux-next.orig/mm/page-writeback.c	2011-08-26 17:58:01.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-26 17:59:06.000000000 +0800
@@ -800,7 +800,7 @@ static void bdi_update_dirty_ratelimit(s
 	 * the dirty count meet the setpoint, but also where the slope of
 	 * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
 	 */
-	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
+	balanced_dirty_ratelimit = div_u64((u64)dirty_ratelimit * write_bw,
 					   dirty_rate | 1);
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26  8:56                                     ` Peter Zijlstra
@ 2011-08-26  9:53                                       ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-26  9:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 04:56:11PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 09:56 +0800, Wu Fengguang wrote:
> >         /*
> >          * A linear estimation of the "balanced" throttle rate. The theory is,
> >          * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
> >          * dirty_rate will be measured to be (N * task_ratelimit). So the below
> >          * formula will yield the balanced rate limit (write_bw / N).
> >          *
> >          * Note that the expanded form is not a pure rate feedback:
> >          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate)              (1)
> >          * but also takes pos_ratio into account:
> >          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
> >          *
> >          * (1) is not realistic because pos_ratio also takes part in balancing
> >          * the dirty rate.  Consider the state
> >          *      pos_ratio = 0.5                                              (3)
> >          *      rate = 2 * (write_bw / N)                                    (4)
> >          * If (1) is used, it will stuck in that state! Because each dd will be
> >          * throttled at
> >          *      task_ratelimit = pos_ratio * rate = (write_bw / N)           (5)
> >          * yielding
> >          *      dirty_rate = N * task_ratelimit = write_bw                   (6)
> >          * put (6) into (1) we get
> >          *      rate_(i+1) = rate_(i)                                        (7)
> >          *
> >          * So we end up using (2) to always keep
> >          *      rate_(i+1) ~= (write_bw / N)                                 (8)
> >          * regardless of the value of pos_ratio. As long as (8) is satisfied,
> >          * pos_ratio is able to drive itself to 1.0, which is not only where
> >          * the dirty count meet the setpoint, but also where the slope of
> >          * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
> >          */ 
> 
> I'm still not buying this, it has the massive assumption N is a
> constant, without that assumption you get the same kind of thing you get
> from not adding pos_ratio to the feedback term.

The reasoning between (3)-(7) actually assumes both N and write_bw to
be some constant. It's documenting some stuck state..

> Also, I've yet to see what harm it does if you leave it out, all
> feedback loops should stabilize just fine.

That's a good question. It should be trivial to try out equation (1)
and see how it work out in practice. Let me collect some figures..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26  9:53                                       ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-26  9:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 04:56:11PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 09:56 +0800, Wu Fengguang wrote:
> >         /*
> >          * A linear estimation of the "balanced" throttle rate. The theory is,
> >          * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
> >          * dirty_rate will be measured to be (N * task_ratelimit). So the below
> >          * formula will yield the balanced rate limit (write_bw / N).
> >          *
> >          * Note that the expanded form is not a pure rate feedback:
> >          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate)              (1)
> >          * but also takes pos_ratio into account:
> >          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
> >          *
> >          * (1) is not realistic because pos_ratio also takes part in balancing
> >          * the dirty rate.  Consider the state
> >          *      pos_ratio = 0.5                                              (3)
> >          *      rate = 2 * (write_bw / N)                                    (4)
> >          * If (1) is used, it will stuck in that state! Because each dd will be
> >          * throttled at
> >          *      task_ratelimit = pos_ratio * rate = (write_bw / N)           (5)
> >          * yielding
> >          *      dirty_rate = N * task_ratelimit = write_bw                   (6)
> >          * put (6) into (1) we get
> >          *      rate_(i+1) = rate_(i)                                        (7)
> >          *
> >          * So we end up using (2) to always keep
> >          *      rate_(i+1) ~= (write_bw / N)                                 (8)
> >          * regardless of the value of pos_ratio. As long as (8) is satisfied,
> >          * pos_ratio is able to drive itself to 1.0, which is not only where
> >          * the dirty count meet the setpoint, but also where the slope of
> >          * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
> >          */ 
> 
> I'm still not buying this, it has the massive assumption N is a
> constant, without that assumption you get the same kind of thing you get
> from not adding pos_ratio to the feedback term.

The reasoning between (3)-(7) actually assumes both N and write_bw to
be some constant. It's documenting some stuck state..

> Also, I've yet to see what harm it does if you leave it out, all
> feedback loops should stabilize just fine.

That's a good question. It should be trivial to try out equation (1)
and see how it work out in practice. Let me collect some figures..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26  0:18                               ` Wu Fengguang
@ 2011-08-26  9:04                                 ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-26  9:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 08:18 +0800, Wu Fengguang wrote:
> On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote:
> > On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:

> > > Put (6) into (4), we get
> > > 
> > >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> > >                             = (write_bw / N) * 2
> > > 
> > > That means, any position imbalance will lead to balanced_rate
> > > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > > always get the right balanced dirty ratelimit value whether or not
> > > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > > dirty position control.
> > > 
> > > (*) independent as in real values, not the seemingly relations in equation
> > 
> > 
> > The assumption here is that N is a constant.. in the above case
> > pos_ratio would eventually end up at 1 and things would be good again. I
> > see your argument about oscillations, but I think you can introduce
> > similar effects by varying N.
> 
> Yeah, it's very possible for N to change over time, in which case
> balanced_rate will adapt to new N in similar way.

Gah.. but but but, that gives the same stuff as your (6)+(4). Why won't
you accept that for pos_ratio but you don't mind for N ?



^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26  9:04                                 ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-26  9:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 08:18 +0800, Wu Fengguang wrote:
> On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote:
> > On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:

> > > Put (6) into (4), we get
> > > 
> > >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> > >                             = (write_bw / N) * 2
> > > 
> > > That means, any position imbalance will lead to balanced_rate
> > > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > > always get the right balanced dirty ratelimit value whether or not
> > > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > > dirty position control.
> > > 
> > > (*) independent as in real values, not the seemingly relations in equation
> > 
> > 
> > The assumption here is that N is a constant.. in the above case
> > pos_ratio would eventually end up at 1 and things would be good again. I
> > see your argument about oscillations, but I think you can introduce
> > similar effects by varying N.
> 
> Yeah, it's very possible for N to change over time, in which case
> balanced_rate will adapt to new N in similar way.

Gah.. but but but, that gives the same stuff as your (6)+(4). Why won't
you accept that for pos_ratio but you don't mind for N ?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26  1:56                                   ` Wu Fengguang
@ 2011-08-26  8:56                                     ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-26  8:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 09:56 +0800, Wu Fengguang wrote:
>         /*
>          * A linear estimation of the "balanced" throttle rate. The theory is,
>          * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
>          * dirty_rate will be measured to be (N * task_ratelimit). So the below
>          * formula will yield the balanced rate limit (write_bw / N).
>          *
>          * Note that the expanded form is not a pure rate feedback:
>          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate)              (1)
>          * but also takes pos_ratio into account:
>          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
>          *
>          * (1) is not realistic because pos_ratio also takes part in balancing
>          * the dirty rate.  Consider the state
>          *      pos_ratio = 0.5                                              (3)
>          *      rate = 2 * (write_bw / N)                                    (4)
>          * If (1) is used, it will stuck in that state! Because each dd will be
>          * throttled at
>          *      task_ratelimit = pos_ratio * rate = (write_bw / N)           (5)
>          * yielding
>          *      dirty_rate = N * task_ratelimit = write_bw                   (6)
>          * put (6) into (1) we get
>          *      rate_(i+1) = rate_(i)                                        (7)
>          *
>          * So we end up using (2) to always keep
>          *      rate_(i+1) ~= (write_bw / N)                                 (8)
>          * regardless of the value of pos_ratio. As long as (8) is satisfied,
>          * pos_ratio is able to drive itself to 1.0, which is not only where
>          * the dirty count meet the setpoint, but also where the slope of
>          * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
>          */ 

I'm still not buying this, it has the massive assumption N is a
constant, without that assumption you get the same kind of thing you get
from not adding pos_ratio to the feedback term.

Also, I've yet to see what harm it does if you leave it out, all
feedback loops should stabilize just fine.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26  8:56                                     ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-26  8:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 09:56 +0800, Wu Fengguang wrote:
>         /*
>          * A linear estimation of the "balanced" throttle rate. The theory is,
>          * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
>          * dirty_rate will be measured to be (N * task_ratelimit). So the below
>          * formula will yield the balanced rate limit (write_bw / N).
>          *
>          * Note that the expanded form is not a pure rate feedback:
>          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate)              (1)
>          * but also takes pos_ratio into account:
>          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
>          *
>          * (1) is not realistic because pos_ratio also takes part in balancing
>          * the dirty rate.  Consider the state
>          *      pos_ratio = 0.5                                              (3)
>          *      rate = 2 * (write_bw / N)                                    (4)
>          * If (1) is used, it will stuck in that state! Because each dd will be
>          * throttled at
>          *      task_ratelimit = pos_ratio * rate = (write_bw / N)           (5)
>          * yielding
>          *      dirty_rate = N * task_ratelimit = write_bw                   (6)
>          * put (6) into (1) we get
>          *      rate_(i+1) = rate_(i)                                        (7)
>          *
>          * So we end up using (2) to always keep
>          *      rate_(i+1) ~= (write_bw / N)                                 (8)
>          * regardless of the value of pos_ratio. As long as (8) is satisfied,
>          * pos_ratio is able to drive itself to 1.0, which is not only where
>          * the dirty count meet the setpoint, but also where the slope of
>          * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
>          */ 

I'm still not buying this, it has the massive assumption N is a
constant, without that assumption you get the same kind of thing you get
from not adding pos_ratio to the feedback term.

Also, I've yet to see what harm it does if you leave it out, all
feedback loops should stabilize just fine.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-25 22:20                                 ` Vivek Goyal
@ 2011-08-26  1:56                                   ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-26  1:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 06:20:01AM +0800, Vivek Goyal wrote:
> On Thu, Aug 25, 2011 at 11:19:34AM +0800, Wu Fengguang wrote:
> 
> [..]
> > > So you are trying to make one feedback loop aware of second loop so that
> > > if second loop is unbalanced, first loop reacts to that as well and not
> > > just look at dirty_rate and write_bw. So refining new balanced rate by
> > > pos_ratio helps.
> > > 						      write_bw	
> > > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> > > 						      dirty_bw
> > > 
> > > Now if global dirty pages are imbalanced, balanced rate will still go
> > > down despite the fact that dirty_bw == write_bw. This will lead to
> > > further reduction in task dirty rate. Which in turn will lead to reduced
> > > number of dirty rate and should eventually lead to pos_ratio=1.
> > 
> > Right, that's a good alternative viewpoint to the below one.
> > 
> >   						  write_bw	
> >   bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * ---------
> >   						  dirty_bw
> > 
> > (1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms
> > (2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0
> 
> Personally I found it much easier to understand the other representation.
> Once you have come up with equation.
> 
> balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw
> 
> Can you please put few lines of comments to explain that why above
> alone is not sufficient and we need to take pos_ratio also in to
> account to keep number of dirty pages in check. And then go onto
> 
> balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw * pos_ratio
> 
> This kind of maintains the continuity of explanation and explains
> that why are we deviating from the theory we discussed so far.

Good point. Here is the commented code:

        /*
         * task_ratelimit reflects each dd's dirty rate for the past 200ms.
         */
        task_ratelimit = (u64)dirty_ratelimit *
                                        pos_ratio >> RATELIMIT_CALC_SHIFT;

        /*
         * A linear estimation of the "balanced" throttle rate. The theory is,
         * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
         * dirty_rate will be measured to be (N * task_ratelimit). So the below
         * formula will yield the balanced rate limit (write_bw / N).
         *
         * Note that the expanded form is not a pure rate feedback:
         *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate)              (1)
         * but also takes pos_ratio into account:
         *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
         *
         * (1) is not realistic because pos_ratio also takes part in balancing
         * the dirty rate.  Consider the state
         *      pos_ratio = 0.5                                              (3)
         *      rate = 2 * (write_bw / N)                                    (4)
         * If (1) is used, it will stuck in that state! Because each dd will be
         * throttled at
         *      task_ratelimit = pos_ratio * rate = (write_bw / N)           (5)
         * yielding
         *      dirty_rate = N * task_ratelimit = write_bw                   (6)
         * put (6) into (1) we get
         *      rate_(i+1) = rate_(i)                                        (7)
         *
         * So we end up using (2) to always keep
         *      rate_(i+1) ~= (write_bw / N)                                 (8)
         * regardless of the value of pos_ratio. As long as (8) is satisfied,
         * pos_ratio is able to drive itself to 1.0, which is not only where
         * the dirty count meet the setpoint, but also where the slope of
         * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
         */
        balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
                                           dirty_rate | 1);

> > 
> > > A related question though I should have asked you this long back. How does
> > > throttling based on rate helps. Why we could not just work with two
> > > pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
> > > And then throttle task gradually to achieve smooth throttling behavior.
> > > IOW, what property does rate provide which is not available just by
> > > looking at per bdi dirty pages. Can't we come up with bdi setpoint and
> > > limit the way you have done for gloabl setpoint and throttle tasks
> > > accordingly?
> > 
> > Good question. If we have no idea of the balanced rate at all, but
> > still want to limit dirty pages within the range [freerun, limit],
> > all we can do is to throttle the task at eg. 1TB/s at @freerun and
> > 0 at @limit. Then you get a really sharp control line which will make
> > task_ratelimit fluctuate like mad...
> > 
> > So the balanced rate estimation is the key to get smooth task_ratelimit,
> > while pos_ratio is the ultimate guarantee for the dirty pages range.
> 
> Ok, that makes sense. By keeping an estimation of rate at which bdi
> can write, our range of throttling goes down. Say 0 to 300MB/s instead
> of 0 to 1TB/sec and that can lead to a more smooth behavior.

Yeah exactly, and even better, we can make the slope much more flat
around the setpoint to achieve excellent smoothness in stable state :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26  1:56                                   ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-26  1:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 06:20:01AM +0800, Vivek Goyal wrote:
> On Thu, Aug 25, 2011 at 11:19:34AM +0800, Wu Fengguang wrote:
> 
> [..]
> > > So you are trying to make one feedback loop aware of second loop so that
> > > if second loop is unbalanced, first loop reacts to that as well and not
> > > just look at dirty_rate and write_bw. So refining new balanced rate by
> > > pos_ratio helps.
> > > 						      write_bw	
> > > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> > > 						      dirty_bw
> > > 
> > > Now if global dirty pages are imbalanced, balanced rate will still go
> > > down despite the fact that dirty_bw == write_bw. This will lead to
> > > further reduction in task dirty rate. Which in turn will lead to reduced
> > > number of dirty rate and should eventually lead to pos_ratio=1.
> > 
> > Right, that's a good alternative viewpoint to the below one.
> > 
> >   						  write_bw	
> >   bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * ---------
> >   						  dirty_bw
> > 
> > (1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms
> > (2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0
> 
> Personally I found it much easier to understand the other representation.
> Once you have come up with equation.
> 
> balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw
> 
> Can you please put few lines of comments to explain that why above
> alone is not sufficient and we need to take pos_ratio also in to
> account to keep number of dirty pages in check. And then go onto
> 
> balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw * pos_ratio
> 
> This kind of maintains the continuity of explanation and explains
> that why are we deviating from the theory we discussed so far.

Good point. Here is the commented code:

        /*
         * task_ratelimit reflects each dd's dirty rate for the past 200ms.
         */
        task_ratelimit = (u64)dirty_ratelimit *
                                        pos_ratio >> RATELIMIT_CALC_SHIFT;

        /*
         * A linear estimation of the "balanced" throttle rate. The theory is,
         * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
         * dirty_rate will be measured to be (N * task_ratelimit). So the below
         * formula will yield the balanced rate limit (write_bw / N).
         *
         * Note that the expanded form is not a pure rate feedback:
         *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate)              (1)
         * but also takes pos_ratio into account:
         *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
         *
         * (1) is not realistic because pos_ratio also takes part in balancing
         * the dirty rate.  Consider the state
         *      pos_ratio = 0.5                                              (3)
         *      rate = 2 * (write_bw / N)                                    (4)
         * If (1) is used, it will stuck in that state! Because each dd will be
         * throttled at
         *      task_ratelimit = pos_ratio * rate = (write_bw / N)           (5)
         * yielding
         *      dirty_rate = N * task_ratelimit = write_bw                   (6)
         * put (6) into (1) we get
         *      rate_(i+1) = rate_(i)                                        (7)
         *
         * So we end up using (2) to always keep
         *      rate_(i+1) ~= (write_bw / N)                                 (8)
         * regardless of the value of pos_ratio. As long as (8) is satisfied,
         * pos_ratio is able to drive itself to 1.0, which is not only where
         * the dirty count meet the setpoint, but also where the slope of
         * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
         */
        balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
                                           dirty_rate | 1);

> > 
> > > A related question though I should have asked you this long back. How does
> > > throttling based on rate helps. Why we could not just work with two
> > > pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
> > > And then throttle task gradually to achieve smooth throttling behavior.
> > > IOW, what property does rate provide which is not available just by
> > > looking at per bdi dirty pages. Can't we come up with bdi setpoint and
> > > limit the way you have done for gloabl setpoint and throttle tasks
> > > accordingly?
> > 
> > Good question. If we have no idea of the balanced rate at all, but
> > still want to limit dirty pages within the range [freerun, limit],
> > all we can do is to throttle the task at eg. 1TB/s at @freerun and
> > 0 at @limit. Then you get a really sharp control line which will make
> > task_ratelimit fluctuate like mad...
> > 
> > So the balanced rate estimation is the key to get smooth task_ratelimit,
> > while pos_ratio is the ultimate guarantee for the dirty pages range.
> 
> Ok, that makes sense. By keeping an estimation of rate at which bdi
> can write, our range of throttling goes down. Say 0 to 300MB/s instead
> of 0 to 1TB/sec and that can lead to a more smooth behavior.

Yeah exactly, and even better, we can make the slope much more flat
around the setpoint to achieve excellent smoothness in stable state :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-24 16:12                             ` Peter Zijlstra
@ 2011-08-26  0:18                               ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-26  0:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote:
> On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:
> > > You somehow directly jump to  
> > > 
> > > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > > 
> > > without explaining why following will not work.
> > > 
> > > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> > 
> > Thanks for asking that, it's probably the root of confusions, so let
> > me answer it standalone.
> > 
> > It's actually pretty simple to explain this equation:
> > 
> >                                                write_bw
> >         balanced_rate = task_ratelimit_200ms * ----------       (1)
> >                                                dirty_rate
> > 
> > If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> > for the past 200ms, we are going to measure the overall bdi dirty rate
> > 
> >         dirty_rate = N * task_ratelimit_200ms                   (2)
> > 
> > put (2) into (1) we get
> > 
> >         balanced_rate = write_bw / N                            (3)
> > 
> > So equation (1) is the right estimation to get the desired target (3).
> > 
> > 
> > As for
> > 
> >                                                   write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
> >                                                   dirty_rate
> > 
> > Let's compare it with the "expanded" form of (1):
> > 
> >                                                               write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
> >                                                               dirty_rate
> > 
> > So the difference lies in pos_ratio.
> > 
> > Believe it or not, it's exactly the seemingly use of pos_ratio that
> > makes (5) independent(*) of the position control.
> > 
> > Why? Look at (4), assume the system is in a state
> > 
> > - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> > - dirty position is not balanced, for example pos_ratio = 0.5
> > 
> > balance_dirty_pages() will be rate limiting each tasks at half the
> > balanced dirty rate, yielding a measured
> > 
> >         dirty_rate = write_bw / 2                               (6)
> > 
> > Put (6) into (4), we get
> > 
> >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> >                             = (write_bw / N) * 2
> > 
> > That means, any position imbalance will lead to balanced_rate
> > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > always get the right balanced dirty ratelimit value whether or not
> > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > dirty position control.
> > 
> > (*) independent as in real values, not the seemingly relations in equation
> 
> 
> The assumption here is that N is a constant.. in the above case
> pos_ratio would eventually end up at 1 and things would be good again. I
> see your argument about oscillations, but I think you can introduce
> similar effects by varying N.

Yeah, it's very possible for N to change over time, in which case
balanced_rate will adapt to new N in similar way.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26  0:18                               ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-26  0:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote:
> On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:
> > > You somehow directly jump to  
> > > 
> > > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > > 
> > > without explaining why following will not work.
> > > 
> > > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> > 
> > Thanks for asking that, it's probably the root of confusions, so let
> > me answer it standalone.
> > 
> > It's actually pretty simple to explain this equation:
> > 
> >                                                write_bw
> >         balanced_rate = task_ratelimit_200ms * ----------       (1)
> >                                                dirty_rate
> > 
> > If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> > for the past 200ms, we are going to measure the overall bdi dirty rate
> > 
> >         dirty_rate = N * task_ratelimit_200ms                   (2)
> > 
> > put (2) into (1) we get
> > 
> >         balanced_rate = write_bw / N                            (3)
> > 
> > So equation (1) is the right estimation to get the desired target (3).
> > 
> > 
> > As for
> > 
> >                                                   write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
> >                                                   dirty_rate
> > 
> > Let's compare it with the "expanded" form of (1):
> > 
> >                                                               write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
> >                                                               dirty_rate
> > 
> > So the difference lies in pos_ratio.
> > 
> > Believe it or not, it's exactly the seemingly use of pos_ratio that
> > makes (5) independent(*) of the position control.
> > 
> > Why? Look at (4), assume the system is in a state
> > 
> > - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> > - dirty position is not balanced, for example pos_ratio = 0.5
> > 
> > balance_dirty_pages() will be rate limiting each tasks at half the
> > balanced dirty rate, yielding a measured
> > 
> >         dirty_rate = write_bw / 2                               (6)
> > 
> > Put (6) into (4), we get
> > 
> >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> >                             = (write_bw / N) * 2
> > 
> > That means, any position imbalance will lead to balanced_rate
> > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > always get the right balanced dirty ratelimit value whether or not
> > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > dirty position control.
> > 
> > (*) independent as in real values, not the seemingly relations in equation
> 
> 
> The assumption here is that N is a constant.. in the above case
> pos_ratio would eventually end up at 1 and things would be good again. I
> see your argument about oscillations, but I think you can introduce
> similar effects by varying N.

Yeah, it's very possible for N to change over time, in which case
balanced_rate will adapt to new N in similar way.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-25  3:19                               ` Wu Fengguang
@ 2011-08-25 22:20                                 ` Vivek Goyal
  -1 siblings, 0 replies; 203+ messages in thread
From: Vivek Goyal @ 2011-08-25 22:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 25, 2011 at 11:19:34AM +0800, Wu Fengguang wrote:

[..]
> > So you are trying to make one feedback loop aware of second loop so that
> > if second loop is unbalanced, first loop reacts to that as well and not
> > just look at dirty_rate and write_bw. So refining new balanced rate by
> > pos_ratio helps.
> > 						      write_bw	
> > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> > 						      dirty_bw
> > 
> > Now if global dirty pages are imbalanced, balanced rate will still go
> > down despite the fact that dirty_bw == write_bw. This will lead to
> > further reduction in task dirty rate. Which in turn will lead to reduced
> > number of dirty rate and should eventually lead to pos_ratio=1.
> 
> Right, that's a good alternative viewpoint to the below one.
> 
>   						  write_bw	
>   bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * ---------
>   						  dirty_bw
> 
> (1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms
> (2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0

Personally I found it much easier to understand the other representation.
Once you have come up with equation.

balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw

Can you please put few lines of comments to explain that why above
alone is not sufficient and we need to take pos_ratio also in to
account to keep number of dirty pages in check. And then go onto

balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw * pos_ratio

This kind of maintains the continuity of explanation and explains
that why are we deviating from the theory we discussed so far.

> 
> > A related question though I should have asked you this long back. How does
> > throttling based on rate helps. Why we could not just work with two
> > pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
> > And then throttle task gradually to achieve smooth throttling behavior.
> > IOW, what property does rate provide which is not available just by
> > looking at per bdi dirty pages. Can't we come up with bdi setpoint and
> > limit the way you have done for gloabl setpoint and throttle tasks
> > accordingly?
> 
> Good question. If we have no idea of the balanced rate at all, but
> still want to limit dirty pages within the range [freerun, limit],
> all we can do is to throttle the task at eg. 1TB/s at @freerun and
> 0 at @limit. Then you get a really sharp control line which will make
> task_ratelimit fluctuate like mad...
> 
> So the balanced rate estimation is the key to get smooth task_ratelimit,
> while pos_ratio is the ultimate guarantee for the dirty pages range.

Ok, that makes sense. By keeping an estimation of rate at which bdi
can write, our range of throttling goes down. Say 0 to 300MB/s instead
of 0 to 1TB/sec and that can lead to a more smooth behavior.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-25 22:20                                 ` Vivek Goyal
  0 siblings, 0 replies; 203+ messages in thread
From: Vivek Goyal @ 2011-08-25 22:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 25, 2011 at 11:19:34AM +0800, Wu Fengguang wrote:

[..]
> > So you are trying to make one feedback loop aware of second loop so that
> > if second loop is unbalanced, first loop reacts to that as well and not
> > just look at dirty_rate and write_bw. So refining new balanced rate by
> > pos_ratio helps.
> > 						      write_bw	
> > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> > 						      dirty_bw
> > 
> > Now if global dirty pages are imbalanced, balanced rate will still go
> > down despite the fact that dirty_bw == write_bw. This will lead to
> > further reduction in task dirty rate. Which in turn will lead to reduced
> > number of dirty rate and should eventually lead to pos_ratio=1.
> 
> Right, that's a good alternative viewpoint to the below one.
> 
>   						  write_bw	
>   bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * ---------
>   						  dirty_bw
> 
> (1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms
> (2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0

Personally I found it much easier to understand the other representation.
Once you have come up with equation.

balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw

Can you please put few lines of comments to explain that why above
alone is not sufficient and we need to take pos_ratio also in to
account to keep number of dirty pages in check. And then go onto

balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw * pos_ratio

This kind of maintains the continuity of explanation and explains
that why are we deviating from the theory we discussed so far.

> 
> > A related question though I should have asked you this long back. How does
> > throttling based on rate helps. Why we could not just work with two
> > pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
> > And then throttle task gradually to achieve smooth throttling behavior.
> > IOW, what property does rate provide which is not available just by
> > looking at per bdi dirty pages. Can't we come up with bdi setpoint and
> > limit the way you have done for gloabl setpoint and throttle tasks
> > accordingly?
> 
> Good question. If we have no idea of the balanced rate at all, but
> still want to limit dirty pages within the range [freerun, limit],
> all we can do is to throttle the task at eg. 1TB/s at @freerun and
> 0 at @limit. Then you get a really sharp control line which will make
> task_ratelimit fluctuate like mad...
> 
> So the balanced rate estimation is the key to get smooth task_ratelimit,
> while pos_ratio is the ultimate guarantee for the dirty pages range.

Ok, that makes sense. By keeping an estimation of rate at which bdi
can write, our range of throttling goes down. Say 0 to 300MB/s instead
of 0 to 1TB/sec and that can lead to a more smooth behavior.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-24 15:57                         ` Peter Zijlstra
@ 2011-08-25  5:30                           ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-25  5:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 24, 2011 at 11:57:39PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-23 at 22:15 +0800, Wu Fengguang wrote:
> > On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> > > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > > > - not a factor at all for updating balanced_rate (whether or not we do (2))
> > > >   well, in this concept: the balanced_rate formula inherently does not
> > > >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> > > >   based on the ratelimit executed for the past 200ms:
> > > > 
> > > >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> > > 
> > > Ok, this is where it all goes funny..
> > > 
> > > So if you want completely separated feedback loops I would expect
> > 
> > If call it feedback loops, then it's a series of independent feedback
> > loops of depth 1.  Because each balanced_rate is a fresh estimation
> > dependent solely on
> > 
> > - writeout bandwidth
> > - N, the number of dd tasks
> > 
> > in the past 200ms.
> > 
> > As long as a CONSTANT ratelimit (whatever value it is) is executed in
> > the past 200ms, we can get the same balanced_rate.
> > 
> >         balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate
> > 
> > The resulted balanced_rate is independent of how large the CONSTANT
> > ratelimit is, because if we start with a doubled CONSTANT ratelimit,
> > we'll see doubled dirty_rate and result in the same balanced_rate. 
> > 
> > In that manner, balance_rate_(i+1) is not really depending on the
> > value of balance_rate_(i): whatever balance_rate_(i) is, we are going
> > to get the same balance_rate_(i+1) 
> 
> At best this argument says it doesn't matter what we use, making
> balance_rate_i an equally valid choice. However I don't buy this, your
> argument is broken, your CONSTANT_ratelimit breaks feedback but then you
> rely on the iterative form of feedback to finish your argument.
> 
> Consider:
> 
> 	r_(i+1) = r_i * ratio_i
> 
> you say, r_i := C for all i, then by definition ratio_i must be 1 and
> you've got nothing. The only way your conclusion can be right is by
> allowing the proper iteration, otherwise we'll never reach the
> equilibrium.
> 
> Now it is true you can introduce random perturbations in r_i at any
> given point and still end up in equilibrium, such is the power of
> iterative feedback, but that doesn't say you can do away with r_i. 

Sure there are always r_i.

Sorry what I mean CONSTANT_ratelimit is, it remains CONSTANT _inside_
every 200ms. There will be a series of different CONSTANT values for
each 200ms, which is roughly (r_i * pos_ratio_i).

> > > something like:
> > > 
> > > 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> > > 
> > > The former is a complete feedback loop, expressing the new value in the
> > > old value (*) with bw_ratio as feedback parameter; if we throttled too
> > > much, the dirty_rate will have dropped and the bw_ratio will be <1
> > > causing the balance_rate to drop increasing the dirty_rate, and vice
> > > versa.
> > 
> > In principle, the bw_ratio works that way. However since
> > balance_rate_(i) is not the exact _executed_ ratelimit in
> > balance_dirty_pages().
> 
> This seems to be where your argument goes bad, the actually executed
> ratelimit is not important, the variance introduced by pos_ratio is
> purely for the benefit of the dirty page count. 
> 
> It doesn't matter for the balance_rate. Without pos_ratio, the dirty
> page count would stay stable (ignoring all these oscillations and other
> fun things), and therefore it is the balance_rate we should be using for
> the iterative feedback.

Nope. The dirty page count can always stay stable somewhere (but not
necessarily at setpoint) purely by the pos_ratio feedback, as illustrated
by Vivek's example.

But that's not the balance state we want. Although the pos_ratio
feedback all by itself is strong enough to keep (dirty_rate == write_bw),
the ideal state is to achieve pos_ratio=1 and eliminate its feedback
error as much as possible, so as to get smooth task_ratelimit.

We may take this viewpoint: a "successful" balance_rate should help
keep pos_ratio around 1.0 in long term.

> > > (*) which is the form I expected and why I thought your primary feedback
> > > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
> >  
> > Because the executed ratelimit was rate_(i) * pos_ratio.
> 
> No, because iterative feedback has the form: 
> 
> 	new = old $op $feedback-term
> 

The problem is, the pos_ratio feedback will jump in and prematurely make
$feedback-term = 1, thus rendering the pure rate feedback weak/useless.

> > > Then when you use the balance_rate to actually throttle tasks you apply
> > > your secondary control steering the dirty page count, yielding:
> > > 
> > > 	task_rate = balance_rate * pos_ratio
> > 
> > Right. Note the above formula is not a derived one, 
> 
> Agreed, its not a derived expression but the originator of the dirty
> page count control.
> 
> > but an original
> > one that later leads to pos_ratio showing up in the calculation of
> > balanced_rate.
> 
> That's where I disagree :-)
> 
> > > >   and task_ratelimit_200ms happen to can be estimated from
> > > > 
> > > >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> > > 
> > > >   We may alternatively record every task_ratelimit executed in the
> > > >   past 200ms and average them all to get task_ratelimit_200ms. In this
> > > >   way we take the "superfluous" pos_ratio out of sight :) 
> > > 
> > > Right, so I'm not at all sure that makes sense, its not immediately
> > > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> > > clear to me why your primary feedback loop uses task_ratelimit_200ms at
> > > all. 
> > 
> > task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
> > by balance_dirty_pages(). So this is an original formula:
> > 
> >         task_ratelimit = balance_rate * pos_ratio
> > 
> > task_ratelimit_200ms is also used as an original data source in
> > 
> >         balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> 
> But that's exactly where you conflate the positional feedback with the
> throughput feedback, the effective ratelimit includes the positional
> feedback so that the dirty page count can move around, but that is
> completely orthogonal to the throughput feedback since the throughout
> thing would leave the dirty count constant (ideal case again).
> 
> That is, yes the iterative feedback still works because you still got
> your primary feedback in place, but the addition of pos_ratio in the
> feedback loop is a pure perturbation and doesn't matter one whit.

The problem is that pure rate feedback is not possible because
pos_ratio also takes part in altering the task rate...

> > Then we try to estimate task_ratelimit_200ms by assuming all tasks
> > have been executing the same CONSTANT ratelimit in
> > balance_dirty_pages(). Hence we get
> > 
> >         task_ratelimit_200ms ~= prev_balance_rate * pos_ratio
> 
> But this just cannot be true (and, as argued above, is completely
> unnecessary). 
> 
> Consider the case where the dirty count is way below the setpoint but
> the base ratelimit is pretty accurate. In that case we would start out
> by creating very low task ratelimits such that the dirty count can

s/low/high/

> increase. Once we match the setpoint we go back to the base ratelimit.
> The average over those 200ms would be <1, but since we're right at the
> setpoint when we do the base ratelimit feedback we pick exactly 1. 

Yeah that's the kind of error introduced by the CONSTANT ratelimit.
Which could be pretty large in small memory boxes. Given that
pos_ratio will fluctuate more anyway when memory and hence the
dirty control scope is small, such rate estimation errors are tolerable.

> Anyway, its completely irrelevant.. :-)

Yeah, that's one step further to discuss all kinds of possible errors
on top of the basic theory :)

> > > >   There is fundamentally no dependency between balanced_rate_(i+1) and
> > > >   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
> > > >   only asks for _whatever_ CONSTANT task ratelimit to be executed for
> > > >   200ms, then it get the balanced rate from the dirty_rate feedback.
> > > 
> > > How can there not be a relation between balance_rate_(i+1) and
> > > balance_rate_(i) ? 
> > 
> > In this manner: even though balance_rate_(i) is somehow used for
> > calculating balance_rate_(i+1), the latter will evaluate to the same
> > value given whatever balance_rate_(i).
> 
> But only if you allow for the iterative feedback to work, you absolutely
> need that balance_rate_(i), without that its completely broken.

Agreed.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-25  5:30                           ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-25  5:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 24, 2011 at 11:57:39PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-23 at 22:15 +0800, Wu Fengguang wrote:
> > On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> > > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > > > - not a factor at all for updating balanced_rate (whether or not we do (2))
> > > >   well, in this concept: the balanced_rate formula inherently does not
> > > >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> > > >   based on the ratelimit executed for the past 200ms:
> > > > 
> > > >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> > > 
> > > Ok, this is where it all goes funny..
> > > 
> > > So if you want completely separated feedback loops I would expect
> > 
> > If call it feedback loops, then it's a series of independent feedback
> > loops of depth 1.  Because each balanced_rate is a fresh estimation
> > dependent solely on
> > 
> > - writeout bandwidth
> > - N, the number of dd tasks
> > 
> > in the past 200ms.
> > 
> > As long as a CONSTANT ratelimit (whatever value it is) is executed in
> > the past 200ms, we can get the same balanced_rate.
> > 
> >         balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate
> > 
> > The resulted balanced_rate is independent of how large the CONSTANT
> > ratelimit is, because if we start with a doubled CONSTANT ratelimit,
> > we'll see doubled dirty_rate and result in the same balanced_rate. 
> > 
> > In that manner, balance_rate_(i+1) is not really depending on the
> > value of balance_rate_(i): whatever balance_rate_(i) is, we are going
> > to get the same balance_rate_(i+1) 
> 
> At best this argument says it doesn't matter what we use, making
> balance_rate_i an equally valid choice. However I don't buy this, your
> argument is broken, your CONSTANT_ratelimit breaks feedback but then you
> rely on the iterative form of feedback to finish your argument.
> 
> Consider:
> 
> 	r_(i+1) = r_i * ratio_i
> 
> you say, r_i := C for all i, then by definition ratio_i must be 1 and
> you've got nothing. The only way your conclusion can be right is by
> allowing the proper iteration, otherwise we'll never reach the
> equilibrium.
> 
> Now it is true you can introduce random perturbations in r_i at any
> given point and still end up in equilibrium, such is the power of
> iterative feedback, but that doesn't say you can do away with r_i. 

Sure there are always r_i.

Sorry what I mean CONSTANT_ratelimit is, it remains CONSTANT _inside_
every 200ms. There will be a series of different CONSTANT values for
each 200ms, which is roughly (r_i * pos_ratio_i).

> > > something like:
> > > 
> > > 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> > > 
> > > The former is a complete feedback loop, expressing the new value in the
> > > old value (*) with bw_ratio as feedback parameter; if we throttled too
> > > much, the dirty_rate will have dropped and the bw_ratio will be <1
> > > causing the balance_rate to drop increasing the dirty_rate, and vice
> > > versa.
> > 
> > In principle, the bw_ratio works that way. However since
> > balance_rate_(i) is not the exact _executed_ ratelimit in
> > balance_dirty_pages().
> 
> This seems to be where your argument goes bad, the actually executed
> ratelimit is not important, the variance introduced by pos_ratio is
> purely for the benefit of the dirty page count. 
> 
> It doesn't matter for the balance_rate. Without pos_ratio, the dirty
> page count would stay stable (ignoring all these oscillations and other
> fun things), and therefore it is the balance_rate we should be using for
> the iterative feedback.

Nope. The dirty page count can always stay stable somewhere (but not
necessarily at setpoint) purely by the pos_ratio feedback, as illustrated
by Vivek's example.

But that's not the balance state we want. Although the pos_ratio
feedback all by itself is strong enough to keep (dirty_rate == write_bw),
the ideal state is to achieve pos_ratio=1 and eliminate its feedback
error as much as possible, so as to get smooth task_ratelimit.

We may take this viewpoint: a "successful" balance_rate should help
keep pos_ratio around 1.0 in long term.

> > > (*) which is the form I expected and why I thought your primary feedback
> > > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
> >  
> > Because the executed ratelimit was rate_(i) * pos_ratio.
> 
> No, because iterative feedback has the form: 
> 
> 	new = old $op $feedback-term
> 

The problem is, the pos_ratio feedback will jump in and prematurely make
$feedback-term = 1, thus rendering the pure rate feedback weak/useless.

> > > Then when you use the balance_rate to actually throttle tasks you apply
> > > your secondary control steering the dirty page count, yielding:
> > > 
> > > 	task_rate = balance_rate * pos_ratio
> > 
> > Right. Note the above formula is not a derived one, 
> 
> Agreed, its not a derived expression but the originator of the dirty
> page count control.
> 
> > but an original
> > one that later leads to pos_ratio showing up in the calculation of
> > balanced_rate.
> 
> That's where I disagree :-)
> 
> > > >   and task_ratelimit_200ms happen to can be estimated from
> > > > 
> > > >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> > > 
> > > >   We may alternatively record every task_ratelimit executed in the
> > > >   past 200ms and average them all to get task_ratelimit_200ms. In this
> > > >   way we take the "superfluous" pos_ratio out of sight :) 
> > > 
> > > Right, so I'm not at all sure that makes sense, its not immediately
> > > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> > > clear to me why your primary feedback loop uses task_ratelimit_200ms at
> > > all. 
> > 
> > task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
> > by balance_dirty_pages(). So this is an original formula:
> > 
> >         task_ratelimit = balance_rate * pos_ratio
> > 
> > task_ratelimit_200ms is also used as an original data source in
> > 
> >         balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> 
> But that's exactly where you conflate the positional feedback with the
> throughput feedback, the effective ratelimit includes the positional
> feedback so that the dirty page count can move around, but that is
> completely orthogonal to the throughput feedback since the throughout
> thing would leave the dirty count constant (ideal case again).
> 
> That is, yes the iterative feedback still works because you still got
> your primary feedback in place, but the addition of pos_ratio in the
> feedback loop is a pure perturbation and doesn't matter one whit.

The problem is that pure rate feedback is not possible because
pos_ratio also takes part in altering the task rate...

> > Then we try to estimate task_ratelimit_200ms by assuming all tasks
> > have been executing the same CONSTANT ratelimit in
> > balance_dirty_pages(). Hence we get
> > 
> >         task_ratelimit_200ms ~= prev_balance_rate * pos_ratio
> 
> But this just cannot be true (and, as argued above, is completely
> unnecessary). 
> 
> Consider the case where the dirty count is way below the setpoint but
> the base ratelimit is pretty accurate. In that case we would start out
> by creating very low task ratelimits such that the dirty count can

s/low/high/

> increase. Once we match the setpoint we go back to the base ratelimit.
> The average over those 200ms would be <1, but since we're right at the
> setpoint when we do the base ratelimit feedback we pick exactly 1. 

Yeah that's the kind of error introduced by the CONSTANT ratelimit.
Which could be pretty large in small memory boxes. Given that
pos_ratio will fluctuate more anyway when memory and hence the
dirty control scope is small, such rate estimation errors are tolerable.

> Anyway, its completely irrelevant.. :-)

Yeah, that's one step further to discuss all kinds of possible errors
on top of the basic theory :)

> > > >   There is fundamentally no dependency between balanced_rate_(i+1) and
> > > >   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
> > > >   only asks for _whatever_ CONSTANT task ratelimit to be executed for
> > > >   200ms, then it get the balanced rate from the dirty_rate feedback.
> > > 
> > > How can there not be a relation between balance_rate_(i+1) and
> > > balance_rate_(i) ? 
> > 
> > In this manner: even though balance_rate_(i) is somehow used for
> > calculating balance_rate_(i+1), the latter will evaluate to the same
> > value given whatever balance_rate_(i).
> 
> But only if you allow for the iterative feedback to work, you absolutely
> need that balance_rate_(i), without that its completely broken.

Agreed.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-24 18:00                             ` Vivek Goyal
@ 2011-08-25  3:19                               ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-25  3:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 25, 2011 at 02:00:58AM +0800, Vivek Goyal wrote:
> On Wed, Aug 24, 2011 at 08:12:58AM +0800, Wu Fengguang wrote:
> > > You somehow directly jump to  
> > > 
> > > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > > 
> > > without explaining why following will not work.
> > > 
> > > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> > 
> > Thanks for asking that, it's probably the root of confusions, so let
> > me answer it standalone.
> > 
> > It's actually pretty simple to explain this equation:
> > 
> >                                                write_bw
> >         balanced_rate = task_ratelimit_200ms * ----------       (1)
> >                                                dirty_rate
> > 
> > If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> > for the past 200ms, we are going to measure the overall bdi dirty rate
> > 
> >         dirty_rate = N * task_ratelimit_200ms                   (2)
> > 
> > put (2) into (1) we get
> > 
> >         balanced_rate = write_bw / N                            (3)
> > 
> > So equation (1) is the right estimation to get the desired target (3).
> > 
> > 
> > As for
> > 
> >                                                   write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
> >                                                   dirty_rate
> > 
> > Let's compare it with the "expanded" form of (1):
> > 
> >                                                               write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
> >                                                               dirty_rate
> > 
> > So the difference lies in pos_ratio.
> > 
> > Believe it or not, it's exactly the seemingly use of pos_ratio that
> > makes (5) independent(*) of the position control.
> > 
> > Why? Look at (4), assume the system is in a state
> > 
> > - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> > - dirty position is not balanced, for example pos_ratio = 0.5
> > 
> > balance_dirty_pages() will be rate limiting each tasks at half the
> > balanced dirty rate, yielding a measured
> > 
> >         dirty_rate = write_bw / 2                               (6)
> > 
> > Put (6) into (4), we get
> > 
> >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> >                             = (write_bw / N) * 2
> > 
> > That means, any position imbalance will lead to balanced_rate
> > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > always get the right balanced dirty ratelimit value whether or not
> > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > dirty position control.
> > 
> > (*) independent as in real values, not the seemingly relations in equation
> 
> Ok, I think I am beginning to see your point. Let me just elaborate on
> the example you gave.

Thank you very much :)

> Assume a system is completely balanced and a task is writing at 100MB/s
> rate.
> 
> write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1
> 
> bdi->dirty_ratelimit = 100MB/s
> 
> Now another tasks starts dirtying the page cache on same bdi. Number of 
> dirty pages should go up pretty fast and likely position ratio feedback
> will kick in to reduce the dirtying rate. (rate based feedback does not
> kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.

That's right. There must be some instantaneous feedback to react to
fast workload changes. With pos_ratio providing this capability, the
estimated balanced rate can take time to follow.

Note that pos_ratio by itself is enough to limit dirty pages within
the [freerun, limit] control scope. The cost of (temporarily) large
error in balanced rate is, task_ratelimit will be fluctuating much
more, due to the fact pos_ratio will depart from 1.0 (to the point it
can fully compensate for the rate errors) and dirty pages approaching
@freerun or @limit where the slope of pos_ratio goes sharp.

The correct estimation of balanced rate serves to drive pos_ratio back
to 1.0, where it has the most flat slope.

> Assume new pos_ratio is .5
> 
> So new throttle rate for both the tasks is 50MB/s.
> 
> bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
> task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s
> 
> Now lets say 200ms have passed and rate base feedback is reevaluated.
> 
> 						        write_bw	
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
> 						        dirty_bw
> 
> bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s
> 
> Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
> that did not happen. And reason being that there are two feedback control
> loops and pos_ratio loops reacts to imbalances much more quickly. Because
> previous loop has already reacted to the imbalance and reduced the
> dirtying rate of task, rate based loop does not try to adjust anything
> and thinks everything is just fine.

That's right.

> Things are fine in the sense that still dirty_rate == write_bw but
> system is not balanced in terms of number of dirty pages and pos_ratio=.5

Yes. The bad thing is, if the above equation (of pure rate feedback)
is used, the system is going to remain in that position-imbalanced
state forever, which is bad for the smoothness of task_ratelimit.

> So you are trying to make one feedback loop aware of second loop so that
> if second loop is unbalanced, first loop reacts to that as well and not
> just look at dirty_rate and write_bw. So refining new balanced rate by
> pos_ratio helps.
> 						      write_bw	
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> 						      dirty_bw
> 
> Now if global dirty pages are imbalanced, balanced rate will still go
> down despite the fact that dirty_bw == write_bw. This will lead to
> further reduction in task dirty rate. Which in turn will lead to reduced
> number of dirty rate and should eventually lead to pos_ratio=1.

Right, that's a good alternative viewpoint to the below one.

  						  write_bw	
  bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * ---------
  						  dirty_bw

(1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms
(2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0

> A related question though I should have asked you this long back. How does
> throttling based on rate helps. Why we could not just work with two
> pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
> And then throttle task gradually to achieve smooth throttling behavior.
> IOW, what property does rate provide which is not available just by
> looking at per bdi dirty pages. Can't we come up with bdi setpoint and
> limit the way you have done for gloabl setpoint and throttle tasks
> accordingly?

Good question. If we have no idea of the balanced rate at all, but
still want to limit dirty pages within the range [freerun, limit],
all we can do is to throttle the task at eg. 1TB/s at @freerun and
0 at @limit. Then you get a really sharp control line which will make
task_ratelimit fluctuate like mad...

So the balanced rate estimation is the key to get smooth task_ratelimit,
while pos_ratio is the ultimate guarantee for the dirty pages range.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-25  3:19                               ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-25  3:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 25, 2011 at 02:00:58AM +0800, Vivek Goyal wrote:
> On Wed, Aug 24, 2011 at 08:12:58AM +0800, Wu Fengguang wrote:
> > > You somehow directly jump to  
> > > 
> > > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > > 
> > > without explaining why following will not work.
> > > 
> > > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> > 
> > Thanks for asking that, it's probably the root of confusions, so let
> > me answer it standalone.
> > 
> > It's actually pretty simple to explain this equation:
> > 
> >                                                write_bw
> >         balanced_rate = task_ratelimit_200ms * ----------       (1)
> >                                                dirty_rate
> > 
> > If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> > for the past 200ms, we are going to measure the overall bdi dirty rate
> > 
> >         dirty_rate = N * task_ratelimit_200ms                   (2)
> > 
> > put (2) into (1) we get
> > 
> >         balanced_rate = write_bw / N                            (3)
> > 
> > So equation (1) is the right estimation to get the desired target (3).
> > 
> > 
> > As for
> > 
> >                                                   write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
> >                                                   dirty_rate
> > 
> > Let's compare it with the "expanded" form of (1):
> > 
> >                                                               write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
> >                                                               dirty_rate
> > 
> > So the difference lies in pos_ratio.
> > 
> > Believe it or not, it's exactly the seemingly use of pos_ratio that
> > makes (5) independent(*) of the position control.
> > 
> > Why? Look at (4), assume the system is in a state
> > 
> > - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> > - dirty position is not balanced, for example pos_ratio = 0.5
> > 
> > balance_dirty_pages() will be rate limiting each tasks at half the
> > balanced dirty rate, yielding a measured
> > 
> >         dirty_rate = write_bw / 2                               (6)
> > 
> > Put (6) into (4), we get
> > 
> >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> >                             = (write_bw / N) * 2
> > 
> > That means, any position imbalance will lead to balanced_rate
> > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > always get the right balanced dirty ratelimit value whether or not
> > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > dirty position control.
> > 
> > (*) independent as in real values, not the seemingly relations in equation
> 
> Ok, I think I am beginning to see your point. Let me just elaborate on
> the example you gave.

Thank you very much :)

> Assume a system is completely balanced and a task is writing at 100MB/s
> rate.
> 
> write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1
> 
> bdi->dirty_ratelimit = 100MB/s
> 
> Now another tasks starts dirtying the page cache on same bdi. Number of 
> dirty pages should go up pretty fast and likely position ratio feedback
> will kick in to reduce the dirtying rate. (rate based feedback does not
> kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.

That's right. There must be some instantaneous feedback to react to
fast workload changes. With pos_ratio providing this capability, the
estimated balanced rate can take time to follow.

Note that pos_ratio by itself is enough to limit dirty pages within
the [freerun, limit] control scope. The cost of (temporarily) large
error in balanced rate is, task_ratelimit will be fluctuating much
more, due to the fact pos_ratio will depart from 1.0 (to the point it
can fully compensate for the rate errors) and dirty pages approaching
@freerun or @limit where the slope of pos_ratio goes sharp.

The correct estimation of balanced rate serves to drive pos_ratio back
to 1.0, where it has the most flat slope.

> Assume new pos_ratio is .5
> 
> So new throttle rate for both the tasks is 50MB/s.
> 
> bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
> task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s
> 
> Now lets say 200ms have passed and rate base feedback is reevaluated.
> 
> 						        write_bw	
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
> 						        dirty_bw
> 
> bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s
> 
> Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
> that did not happen. And reason being that there are two feedback control
> loops and pos_ratio loops reacts to imbalances much more quickly. Because
> previous loop has already reacted to the imbalance and reduced the
> dirtying rate of task, rate based loop does not try to adjust anything
> and thinks everything is just fine.

That's right.

> Things are fine in the sense that still dirty_rate == write_bw but
> system is not balanced in terms of number of dirty pages and pos_ratio=.5

Yes. The bad thing is, if the above equation (of pure rate feedback)
is used, the system is going to remain in that position-imbalanced
state forever, which is bad for the smoothness of task_ratelimit.

> So you are trying to make one feedback loop aware of second loop so that
> if second loop is unbalanced, first loop reacts to that as well and not
> just look at dirty_rate and write_bw. So refining new balanced rate by
> pos_ratio helps.
> 						      write_bw	
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> 						      dirty_bw
> 
> Now if global dirty pages are imbalanced, balanced rate will still go
> down despite the fact that dirty_bw == write_bw. This will lead to
> further reduction in task dirty rate. Which in turn will lead to reduced
> number of dirty rate and should eventually lead to pos_ratio=1.

Right, that's a good alternative viewpoint to the below one.

  						  write_bw	
  bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * ---------
  						  dirty_bw

(1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms
(2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0

> A related question though I should have asked you this long back. How does
> throttling based on rate helps. Why we could not just work with two
> pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
> And then throttle task gradually to achieve smooth throttling behavior.
> IOW, what property does rate provide which is not available just by
> looking at per bdi dirty pages. Can't we come up with bdi setpoint and
> limit the way you have done for gloabl setpoint and throttle tasks
> accordingly?

Good question. If we have no idea of the balanced rate at all, but
still want to limit dirty pages within the range [freerun, limit],
all we can do is to throttle the task at eg. 1TB/s at @freerun and
0 at @limit. Then you get a really sharp control line which will make
task_ratelimit fluctuate like mad...

So the balanced rate estimation is the key to get smooth task_ratelimit,
while pos_ratio is the ultimate guarantee for the dirty pages range.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-24  0:12                           ` Wu Fengguang
@ 2011-08-24 18:00                             ` Vivek Goyal
  -1 siblings, 0 replies; 203+ messages in thread
From: Vivek Goyal @ 2011-08-24 18:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 24, 2011 at 08:12:58AM +0800, Wu Fengguang wrote:
> > You somehow directly jump to  
> > 
> > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > 
> > without explaining why following will not work.
> > 
> > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> 
> Thanks for asking that, it's probably the root of confusions, so let
> me answer it standalone.
> 
> It's actually pretty simple to explain this equation:
> 
>                                                write_bw
>         balanced_rate = task_ratelimit_200ms * ----------       (1)
>                                                dirty_rate
> 
> If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> for the past 200ms, we are going to measure the overall bdi dirty rate
> 
>         dirty_rate = N * task_ratelimit_200ms                   (2)
> 
> put (2) into (1) we get
> 
>         balanced_rate = write_bw / N                            (3)
> 
> So equation (1) is the right estimation to get the desired target (3).
> 
> 
> As for
> 
>                                                   write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
>                                                   dirty_rate
> 
> Let's compare it with the "expanded" form of (1):
> 
>                                                               write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
>                                                               dirty_rate
> 
> So the difference lies in pos_ratio.
> 
> Believe it or not, it's exactly the seemingly use of pos_ratio that
> makes (5) independent(*) of the position control.
> 
> Why? Look at (4), assume the system is in a state
> 
> - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> - dirty position is not balanced, for example pos_ratio = 0.5
> 
> balance_dirty_pages() will be rate limiting each tasks at half the
> balanced dirty rate, yielding a measured
> 
>         dirty_rate = write_bw / 2                               (6)
> 
> Put (6) into (4), we get
> 
>         balanced_rate_(i+1) = balanced_rate_(i) * 2
>                             = (write_bw / N) * 2
> 
> That means, any position imbalance will lead to balanced_rate
> estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> always get the right balanced dirty ratelimit value whether or not
> (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> dirty position control.
> 
> (*) independent as in real values, not the seemingly relations in equation

Ok, I think I am beginning to see your point. Let me just elaborate on
the example you gave.

Assume a system is completely balanced and a task is writing at 100MB/s
rate.

write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1

bdi->dirty_ratelimit = 100MB/s

Now another tasks starts dirtying the page cache on same bdi. Number of 
dirty pages should go up pretty fast and likely position ratio feedback
will kick in to reduce the dirtying rate. (rate based feedback does not
kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.
Assume new pos_ratio is .5

So new throttle rate for both the tasks is 50MB/s.

bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s

Now lets say 200ms have passed and rate base feedback is reevaluated.

						      write_bw	
bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
						      dirty_bw

bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s

Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
that did not happen. And reason being that there are two feedback control
loops and pos_ratio loops reacts to imbalances much more quickly. Because
previous loop has already reacted to the imbalance and reduced the
dirtying rate of task, rate based loop does not try to adjust anything
and thinks everything is just fine.

Things are fine in the sense that still dirty_rate == write_bw but
system is not balanced in terms of number of dirty pages and pos_ratio=.5

So you are trying to make one feedback loop aware of second loop so that
if second loop is unbalanced, first loop reacts to that as well and not
just look at dirty_rate and write_bw. So refining new balanced rate by
pos_ratio helps.
						      write_bw	
bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
						      dirty_bw

Now if global dirty pages are imbalanced, balanced rate will still go
down despite the fact that dirty_bw == write_bw. This will lead to
further reduction in task dirty rate. Which in turn will lead to reduced
number of dirty rate and should eventually lead to pos_ratio=1.

A related question though I should have asked you this long back. How does
throttling based on rate helps. Why we could not just work with two
pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
And then throttle task gradually to achieve smooth throttling behavior.
IOW, what property does rate provide which is not available just by
looking at per bdi dirty pages. Can't we come up with bdi setpoint and
limit the way you have done for gloabl setpoint and throttle tasks
accordingly?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-24 18:00                             ` Vivek Goyal
  0 siblings, 0 replies; 203+ messages in thread
From: Vivek Goyal @ 2011-08-24 18:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 24, 2011 at 08:12:58AM +0800, Wu Fengguang wrote:
> > You somehow directly jump to  
> > 
> > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > 
> > without explaining why following will not work.
> > 
> > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> 
> Thanks for asking that, it's probably the root of confusions, so let
> me answer it standalone.
> 
> It's actually pretty simple to explain this equation:
> 
>                                                write_bw
>         balanced_rate = task_ratelimit_200ms * ----------       (1)
>                                                dirty_rate
> 
> If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> for the past 200ms, we are going to measure the overall bdi dirty rate
> 
>         dirty_rate = N * task_ratelimit_200ms                   (2)
> 
> put (2) into (1) we get
> 
>         balanced_rate = write_bw / N                            (3)
> 
> So equation (1) is the right estimation to get the desired target (3).
> 
> 
> As for
> 
>                                                   write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
>                                                   dirty_rate
> 
> Let's compare it with the "expanded" form of (1):
> 
>                                                               write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
>                                                               dirty_rate
> 
> So the difference lies in pos_ratio.
> 
> Believe it or not, it's exactly the seemingly use of pos_ratio that
> makes (5) independent(*) of the position control.
> 
> Why? Look at (4), assume the system is in a state
> 
> - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> - dirty position is not balanced, for example pos_ratio = 0.5
> 
> balance_dirty_pages() will be rate limiting each tasks at half the
> balanced dirty rate, yielding a measured
> 
>         dirty_rate = write_bw / 2                               (6)
> 
> Put (6) into (4), we get
> 
>         balanced_rate_(i+1) = balanced_rate_(i) * 2
>                             = (write_bw / N) * 2
> 
> That means, any position imbalance will lead to balanced_rate
> estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> always get the right balanced dirty ratelimit value whether or not
> (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> dirty position control.
> 
> (*) independent as in real values, not the seemingly relations in equation

Ok, I think I am beginning to see your point. Let me just elaborate on
the example you gave.

Assume a system is completely balanced and a task is writing at 100MB/s
rate.

write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1

bdi->dirty_ratelimit = 100MB/s

Now another tasks starts dirtying the page cache on same bdi. Number of 
dirty pages should go up pretty fast and likely position ratio feedback
will kick in to reduce the dirtying rate. (rate based feedback does not
kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.
Assume new pos_ratio is .5

So new throttle rate for both the tasks is 50MB/s.

bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s

Now lets say 200ms have passed and rate base feedback is reevaluated.

						      write_bw	
bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
						      dirty_bw

bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s

Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
that did not happen. And reason being that there are two feedback control
loops and pos_ratio loops reacts to imbalances much more quickly. Because
previous loop has already reacted to the imbalance and reduced the
dirtying rate of task, rate based loop does not try to adjust anything
and thinks everything is just fine.

Things are fine in the sense that still dirty_rate == write_bw but
system is not balanced in terms of number of dirty pages and pos_ratio=.5

So you are trying to make one feedback loop aware of second loop so that
if second loop is unbalanced, first loop reacts to that as well and not
just look at dirty_rate and write_bw. So refining new balanced rate by
pos_ratio helps.
						      write_bw	
bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
						      dirty_bw

Now if global dirty pages are imbalanced, balanced rate will still go
down despite the fact that dirty_bw == write_bw. This will lead to
further reduction in task dirty rate. Which in turn will lead to reduced
number of dirty rate and should eventually lead to pos_ratio=1.

A related question though I should have asked you this long back. How does
throttling based on rate helps. Why we could not just work with two
pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
And then throttle task gradually to achieve smooth throttling behavior.
IOW, what property does rate provide which is not available just by
looking at per bdi dirty pages. Can't we come up with bdi setpoint and
limit the way you have done for gloabl setpoint and throttle tasks
accordingly?

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-24  0:12                           ` Wu Fengguang
@ 2011-08-24 16:12                             ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-24 16:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:
> > You somehow directly jump to  
> > 
> > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > 
> > without explaining why following will not work.
> > 
> > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> 
> Thanks for asking that, it's probably the root of confusions, so let
> me answer it standalone.
> 
> It's actually pretty simple to explain this equation:
> 
>                                                write_bw
>         balanced_rate = task_ratelimit_200ms * ----------       (1)
>                                                dirty_rate
> 
> If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> for the past 200ms, we are going to measure the overall bdi dirty rate
> 
>         dirty_rate = N * task_ratelimit_200ms                   (2)
> 
> put (2) into (1) we get
> 
>         balanced_rate = write_bw / N                            (3)
> 
> So equation (1) is the right estimation to get the desired target (3).
> 
> 
> As for
> 
>                                                   write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
>                                                   dirty_rate
> 
> Let's compare it with the "expanded" form of (1):
> 
>                                                               write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
>                                                               dirty_rate
> 
> So the difference lies in pos_ratio.
> 
> Believe it or not, it's exactly the seemingly use of pos_ratio that
> makes (5) independent(*) of the position control.
> 
> Why? Look at (4), assume the system is in a state
> 
> - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> - dirty position is not balanced, for example pos_ratio = 0.5
> 
> balance_dirty_pages() will be rate limiting each tasks at half the
> balanced dirty rate, yielding a measured
> 
>         dirty_rate = write_bw / 2                               (6)
> 
> Put (6) into (4), we get
> 
>         balanced_rate_(i+1) = balanced_rate_(i) * 2
>                             = (write_bw / N) * 2
> 
> That means, any position imbalance will lead to balanced_rate
> estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> always get the right balanced dirty ratelimit value whether or not
> (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> dirty position control.
> 
> (*) independent as in real values, not the seemingly relations in equation


The assumption here is that N is a constant.. in the above case
pos_ratio would eventually end up at 1 and things would be good again. I
see your argument about oscillations, but I think you can introduce
similar effects by varying N.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-24 16:12                             ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-24 16:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:
> > You somehow directly jump to  
> > 
> > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > 
> > without explaining why following will not work.
> > 
> > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> 
> Thanks for asking that, it's probably the root of confusions, so let
> me answer it standalone.
> 
> It's actually pretty simple to explain this equation:
> 
>                                                write_bw
>         balanced_rate = task_ratelimit_200ms * ----------       (1)
>                                                dirty_rate
> 
> If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> for the past 200ms, we are going to measure the overall bdi dirty rate
> 
>         dirty_rate = N * task_ratelimit_200ms                   (2)
> 
> put (2) into (1) we get
> 
>         balanced_rate = write_bw / N                            (3)
> 
> So equation (1) is the right estimation to get the desired target (3).
> 
> 
> As for
> 
>                                                   write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
>                                                   dirty_rate
> 
> Let's compare it with the "expanded" form of (1):
> 
>                                                               write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
>                                                               dirty_rate
> 
> So the difference lies in pos_ratio.
> 
> Believe it or not, it's exactly the seemingly use of pos_ratio that
> makes (5) independent(*) of the position control.
> 
> Why? Look at (4), assume the system is in a state
> 
> - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> - dirty position is not balanced, for example pos_ratio = 0.5
> 
> balance_dirty_pages() will be rate limiting each tasks at half the
> balanced dirty rate, yielding a measured
> 
>         dirty_rate = write_bw / 2                               (6)
> 
> Put (6) into (4), we get
> 
>         balanced_rate_(i+1) = balanced_rate_(i) * 2
>                             = (write_bw / N) * 2
> 
> That means, any position imbalance will lead to balanced_rate
> estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> always get the right balanced dirty ratelimit value whether or not
> (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> dirty position control.
> 
> (*) independent as in real values, not the seemingly relations in equation


The assumption here is that N is a constant.. in the above case
pos_ratio would eventually end up at 1 and things would be good again. I
see your argument about oscillations, but I think you can introduce
similar effects by varying N.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-23 14:15                       ` Wu Fengguang
  (?)
@ 2011-08-24 15:57                         ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-24 15:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-23 at 22:15 +0800, Wu Fengguang wrote:
> On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > > - not a factor at all for updating balanced_rate (whether or not we do (2))
> > >   well, in this concept: the balanced_rate formula inherently does not
> > >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> > >   based on the ratelimit executed for the past 200ms:
> > > 
> > >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> > 
> > Ok, this is where it all goes funny..
> > 
> > So if you want completely separated feedback loops I would expect
> 
> If call it feedback loops, then it's a series of independent feedback
> loops of depth 1.  Because each balanced_rate is a fresh estimation
> dependent solely on
> 
> - writeout bandwidth
> - N, the number of dd tasks
> 
> in the past 200ms.
> 
> As long as a CONSTANT ratelimit (whatever value it is) is executed in
> the past 200ms, we can get the same balanced_rate.
> 
>         balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate
> 
> The resulted balanced_rate is independent of how large the CONSTANT
> ratelimit is, because if we start with a doubled CONSTANT ratelimit,
> we'll see doubled dirty_rate and result in the same balanced_rate. 
> 
> In that manner, balance_rate_(i+1) is not really depending on the
> value of balance_rate_(i): whatever balance_rate_(i) is, we are going
> to get the same balance_rate_(i+1) 

At best this argument says it doesn't matter what we use, making
balance_rate_i an equally valid choice. However I don't buy this, your
argument is broken, your CONSTANT_ratelimit breaks feedback but then you
rely on the iterative form of feedback to finish your argument.

Consider:

	r_(i+1) = r_i * ratio_i

you say, r_i := C for all i, then by definition ratio_i must be 1 and
you've got nothing. The only way your conclusion can be right is by
allowing the proper iteration, otherwise we'll never reach the
equilibrium.

Now it is true you can introduce random perturbations in r_i at any
given point and still end up in equilibrium, such is the power of
iterative feedback, but that doesn't say you can do away with r_i. 

> > something like:
> > 
> > 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> > 
> > The former is a complete feedback loop, expressing the new value in the
> > old value (*) with bw_ratio as feedback parameter; if we throttled too
> > much, the dirty_rate will have dropped and the bw_ratio will be <1
> > causing the balance_rate to drop increasing the dirty_rate, and vice
> > versa.
> 
> In principle, the bw_ratio works that way. However since
> balance_rate_(i) is not the exact _executed_ ratelimit in
> balance_dirty_pages().

This seems to be where your argument goes bad, the actually executed
ratelimit is not important, the variance introduced by pos_ratio is
purely for the benefit of the dirty page count. 

It doesn't matter for the balance_rate. Without pos_ratio, the dirty
page count would stay stable (ignoring all these oscillations and other
fun things), and therefore it is the balance_rate we should be using for
the iterative feedback.

> > (*) which is the form I expected and why I thought your primary feedback
> > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
>  
> Because the executed ratelimit was rate_(i) * pos_ratio.

No, because iterative feedback has the form: 

	new = old $op $feedback-term


> > Then when you use the balance_rate to actually throttle tasks you apply
> > your secondary control steering the dirty page count, yielding:
> > 
> > 	task_rate = balance_rate * pos_ratio
> 
> Right. Note the above formula is not a derived one, 

Agreed, its not a derived expression but the originator of the dirty
page count control.

> but an original
> one that later leads to pos_ratio showing up in the calculation of
> balanced_rate.

That's where I disagree :-)

> > >   and task_ratelimit_200ms happen to can be estimated from
> > > 
> > >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> > 
> > >   We may alternatively record every task_ratelimit executed in the
> > >   past 200ms and average them all to get task_ratelimit_200ms. In this
> > >   way we take the "superfluous" pos_ratio out of sight :) 
> > 
> > Right, so I'm not at all sure that makes sense, its not immediately
> > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> > clear to me why your primary feedback loop uses task_ratelimit_200ms at
> > all. 
> 
> task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
> by balance_dirty_pages(). So this is an original formula:
> 
>         task_ratelimit = balance_rate * pos_ratio
> 
> task_ratelimit_200ms is also used as an original data source in
> 
>         balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

But that's exactly where you conflate the positional feedback with the
throughput feedback, the effective ratelimit includes the positional
feedback so that the dirty page count can move around, but that is
completely orthogonal to the throughput feedback since the throughout
thing would leave the dirty count constant (ideal case again).

That is, yes the iterative feedback still works because you still got
your primary feedback in place, but the addition of pos_ratio in the
feedback loop is a pure perturbation and doesn't matter one whit.

> Then we try to estimate task_ratelimit_200ms by assuming all tasks
> have been executing the same CONSTANT ratelimit in
> balance_dirty_pages(). Hence we get
> 
>         task_ratelimit_200ms ~= prev_balance_rate * pos_ratio

But this just cannot be true (and, as argued above, is completely
unnecessary). 

Consider the case where the dirty count is way below the setpoint but
the base ratelimit is pretty accurate. In that case we would start out
by creating very low task ratelimits such that the dirty count can
increase. Once we match the setpoint we go back to the base ratelimit.
The average over those 200ms would be <1, but since we're right at the
setpoint when we do the base ratelimit feedback we pick exactly 1. 

Anyway, its completely irrelevant.. :-)

> > >   There is fundamentally no dependency between balanced_rate_(i+1) and
> > >   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
> > >   only asks for _whatever_ CONSTANT task ratelimit to be executed for
> > >   200ms, then it get the balanced rate from the dirty_rate feedback.
> > 
> > How can there not be a relation between balance_rate_(i+1) and
> > balance_rate_(i) ? 
> 
> In this manner: even though balance_rate_(i) is somehow used for
> calculating balance_rate_(i+1), the latter will evaluate to the same
> value given whatever balance_rate_(i).

But only if you allow for the iterative feedback to work, you absolutely
need that balance_rate_(i), without that its completely broken.


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-24 15:57                         ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-24 15:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-23 at 22:15 +0800, Wu Fengguang wrote:
> On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > > - not a factor at all for updating balanced_rate (whether or not we do (2))
> > >   well, in this concept: the balanced_rate formula inherently does not
> > >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> > >   based on the ratelimit executed for the past 200ms:
> > > 
> > >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> > 
> > Ok, this is where it all goes funny..
> > 
> > So if you want completely separated feedback loops I would expect
> 
> If call it feedback loops, then it's a series of independent feedback
> loops of depth 1.  Because each balanced_rate is a fresh estimation
> dependent solely on
> 
> - writeout bandwidth
> - N, the number of dd tasks
> 
> in the past 200ms.
> 
> As long as a CONSTANT ratelimit (whatever value it is) is executed in
> the past 200ms, we can get the same balanced_rate.
> 
>         balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate
> 
> The resulted balanced_rate is independent of how large the CONSTANT
> ratelimit is, because if we start with a doubled CONSTANT ratelimit,
> we'll see doubled dirty_rate and result in the same balanced_rate. 
> 
> In that manner, balance_rate_(i+1) is not really depending on the
> value of balance_rate_(i): whatever balance_rate_(i) is, we are going
> to get the same balance_rate_(i+1) 

At best this argument says it doesn't matter what we use, making
balance_rate_i an equally valid choice. However I don't buy this, your
argument is broken, your CONSTANT_ratelimit breaks feedback but then you
rely on the iterative form of feedback to finish your argument.

Consider:

	r_(i+1) = r_i * ratio_i

you say, r_i := C for all i, then by definition ratio_i must be 1 and
you've got nothing. The only way your conclusion can be right is by
allowing the proper iteration, otherwise we'll never reach the
equilibrium.

Now it is true you can introduce random perturbations in r_i at any
given point and still end up in equilibrium, such is the power of
iterative feedback, but that doesn't say you can do away with r_i. 

> > something like:
> > 
> > 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> > 
> > The former is a complete feedback loop, expressing the new value in the
> > old value (*) with bw_ratio as feedback parameter; if we throttled too
> > much, the dirty_rate will have dropped and the bw_ratio will be <1
> > causing the balance_rate to drop increasing the dirty_rate, and vice
> > versa.
> 
> In principle, the bw_ratio works that way. However since
> balance_rate_(i) is not the exact _executed_ ratelimit in
> balance_dirty_pages().

This seems to be where your argument goes bad, the actually executed
ratelimit is not important, the variance introduced by pos_ratio is
purely for the benefit of the dirty page count. 

It doesn't matter for the balance_rate. Without pos_ratio, the dirty
page count would stay stable (ignoring all these oscillations and other
fun things), and therefore it is the balance_rate we should be using for
the iterative feedback.

> > (*) which is the form I expected and why I thought your primary feedback
> > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
>  
> Because the executed ratelimit was rate_(i) * pos_ratio.

No, because iterative feedback has the form: 

	new = old $op $feedback-term


> > Then when you use the balance_rate to actually throttle tasks you apply
> > your secondary control steering the dirty page count, yielding:
> > 
> > 	task_rate = balance_rate * pos_ratio
> 
> Right. Note the above formula is not a derived one, 

Agreed, its not a derived expression but the originator of the dirty
page count control.

> but an original
> one that later leads to pos_ratio showing up in the calculation of
> balanced_rate.

That's where I disagree :-)

> > >   and task_ratelimit_200ms happen to can be estimated from
> > > 
> > >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> > 
> > >   We may alternatively record every task_ratelimit executed in the
> > >   past 200ms and average them all to get task_ratelimit_200ms. In this
> > >   way we take the "superfluous" pos_ratio out of sight :) 
> > 
> > Right, so I'm not at all sure that makes sense, its not immediately
> > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> > clear to me why your primary feedback loop uses task_ratelimit_200ms at
> > all. 
> 
> task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
> by balance_dirty_pages(). So this is an original formula:
> 
>         task_ratelimit = balance_rate * pos_ratio
> 
> task_ratelimit_200ms is also used as an original data source in
> 
>         balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

But that's exactly where you conflate the positional feedback with the
throughput feedback, the effective ratelimit includes the positional
feedback so that the dirty page count can move around, but that is
completely orthogonal to the throughput feedback since the throughout
thing would leave the dirty count constant (ideal case again).

That is, yes the iterative feedback still works because you still got
your primary feedback in place, but the addition of pos_ratio in the
feedback loop is a pure perturbation and doesn't matter one whit.

> Then we try to estimate task_ratelimit_200ms by assuming all tasks
> have been executing the same CONSTANT ratelimit in
> balance_dirty_pages(). Hence we get
> 
>         task_ratelimit_200ms ~= prev_balance_rate * pos_ratio

But this just cannot be true (and, as argued above, is completely
unnecessary). 

Consider the case where the dirty count is way below the setpoint but
the base ratelimit is pretty accurate. In that case we would start out
by creating very low task ratelimits such that the dirty count can
increase. Once we match the setpoint we go back to the base ratelimit.
The average over those 200ms would be <1, but since we're right at the
setpoint when we do the base ratelimit feedback we pick exactly 1. 

Anyway, its completely irrelevant.. :-)

> > >   There is fundamentally no dependency between balanced_rate_(i+1) and
> > >   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
> > >   only asks for _whatever_ CONSTANT task ratelimit to be executed for
> > >   200ms, then it get the balanced rate from the dirty_rate feedback.
> > 
> > How can there not be a relation between balance_rate_(i+1) and
> > balance_rate_(i) ? 
> 
> In this manner: even though balance_rate_(i) is somehow used for
> calculating balance_rate_(i+1), the latter will evaluate to the same
> value given whatever balance_rate_(i).

But only if you allow for the iterative feedback to work, you absolutely
need that balance_rate_(i), without that its completely broken.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-24 15:57                         ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-24 15:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-23 at 22:15 +0800, Wu Fengguang wrote:
> On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > > - not a factor at all for updating balanced_rate (whether or not we do (2))
> > >   well, in this concept: the balanced_rate formula inherently does not
> > >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> > >   based on the ratelimit executed for the past 200ms:
> > > 
> > >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> > 
> > Ok, this is where it all goes funny..
> > 
> > So if you want completely separated feedback loops I would expect
> 
> If call it feedback loops, then it's a series of independent feedback
> loops of depth 1.  Because each balanced_rate is a fresh estimation
> dependent solely on
> 
> - writeout bandwidth
> - N, the number of dd tasks
> 
> in the past 200ms.
> 
> As long as a CONSTANT ratelimit (whatever value it is) is executed in
> the past 200ms, we can get the same balanced_rate.
> 
>         balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate
> 
> The resulted balanced_rate is independent of how large the CONSTANT
> ratelimit is, because if we start with a doubled CONSTANT ratelimit,
> we'll see doubled dirty_rate and result in the same balanced_rate. 
> 
> In that manner, balance_rate_(i+1) is not really depending on the
> value of balance_rate_(i): whatever balance_rate_(i) is, we are going
> to get the same balance_rate_(i+1) 

At best this argument says it doesn't matter what we use, making
balance_rate_i an equally valid choice. However I don't buy this, your
argument is broken, your CONSTANT_ratelimit breaks feedback but then you
rely on the iterative form of feedback to finish your argument.

Consider:

	r_(i+1) = r_i * ratio_i

you say, r_i := C for all i, then by definition ratio_i must be 1 and
you've got nothing. The only way your conclusion can be right is by
allowing the proper iteration, otherwise we'll never reach the
equilibrium.

Now it is true you can introduce random perturbations in r_i at any
given point and still end up in equilibrium, such is the power of
iterative feedback, but that doesn't say you can do away with r_i. 

> > something like:
> > 
> > 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> > 
> > The former is a complete feedback loop, expressing the new value in the
> > old value (*) with bw_ratio as feedback parameter; if we throttled too
> > much, the dirty_rate will have dropped and the bw_ratio will be <1
> > causing the balance_rate to drop increasing the dirty_rate, and vice
> > versa.
> 
> In principle, the bw_ratio works that way. However since
> balance_rate_(i) is not the exact _executed_ ratelimit in
> balance_dirty_pages().

This seems to be where your argument goes bad, the actually executed
ratelimit is not important, the variance introduced by pos_ratio is
purely for the benefit of the dirty page count. 

It doesn't matter for the balance_rate. Without pos_ratio, the dirty
page count would stay stable (ignoring all these oscillations and other
fun things), and therefore it is the balance_rate we should be using for
the iterative feedback.

> > (*) which is the form I expected and why I thought your primary feedback
> > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
>  
> Because the executed ratelimit was rate_(i) * pos_ratio.

No, because iterative feedback has the form: 

	new = old $op $feedback-term


> > Then when you use the balance_rate to actually throttle tasks you apply
> > your secondary control steering the dirty page count, yielding:
> > 
> > 	task_rate = balance_rate * pos_ratio
> 
> Right. Note the above formula is not a derived one, 

Agreed, its not a derived expression but the originator of the dirty
page count control.

> but an original
> one that later leads to pos_ratio showing up in the calculation of
> balanced_rate.

That's where I disagree :-)

> > >   and task_ratelimit_200ms happen to can be estimated from
> > > 
> > >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> > 
> > >   We may alternatively record every task_ratelimit executed in the
> > >   past 200ms and average them all to get task_ratelimit_200ms. In this
> > >   way we take the "superfluous" pos_ratio out of sight :) 
> > 
> > Right, so I'm not at all sure that makes sense, its not immediately
> > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> > clear to me why your primary feedback loop uses task_ratelimit_200ms at
> > all. 
> 
> task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
> by balance_dirty_pages(). So this is an original formula:
> 
>         task_ratelimit = balance_rate * pos_ratio
> 
> task_ratelimit_200ms is also used as an original data source in
> 
>         balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

But that's exactly where you conflate the positional feedback with the
throughput feedback, the effective ratelimit includes the positional
feedback so that the dirty page count can move around, but that is
completely orthogonal to the throughput feedback since the throughout
thing would leave the dirty count constant (ideal case again).

That is, yes the iterative feedback still works because you still got
your primary feedback in place, but the addition of pos_ratio in the
feedback loop is a pure perturbation and doesn't matter one whit.

> Then we try to estimate task_ratelimit_200ms by assuming all tasks
> have been executing the same CONSTANT ratelimit in
> balance_dirty_pages(). Hence we get
> 
>         task_ratelimit_200ms ~= prev_balance_rate * pos_ratio

But this just cannot be true (and, as argued above, is completely
unnecessary). 

Consider the case where the dirty count is way below the setpoint but
the base ratelimit is pretty accurate. In that case we would start out
by creating very low task ratelimits such that the dirty count can
increase. Once we match the setpoint we go back to the base ratelimit.
The average over those 200ms would be <1, but since we're right at the
setpoint when we do the base ratelimit feedback we pick exactly 1. 

Anyway, its completely irrelevant.. :-)

> > >   There is fundamentally no dependency between balanced_rate_(i+1) and
> > >   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
> > >   only asks for _whatever_ CONSTANT task ratelimit to be executed for
> > >   200ms, then it get the balanced rate from the dirty_rate feedback.
> > 
> > How can there not be a relation between balance_rate_(i+1) and
> > balance_rate_(i) ? 
> 
> In this manner: even though balance_rate_(i) is somehow used for
> calculating balance_rate_(i+1), the latter will evaluate to the same
> value given whatever balance_rate_(i).

But only if you allow for the iterative feedback to work, you absolutely
need that balance_rate_(i), without that its completely broken.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-23 17:47                         ` Vivek Goyal
@ 2011-08-24  0:12                           ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-24  0:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

> You somehow directly jump to  
> 
> 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> 
> without explaining why following will not work.
> 
> 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate

Thanks for asking that, it's probably the root of confusions, so let
me answer it standalone.

It's actually pretty simple to explain this equation:

                                               write_bw
        balanced_rate = task_ratelimit_200ms * ----------       (1)
                                               dirty_rate

If there are N dd tasks, each task is throttled at task_ratelimit_200ms
for the past 200ms, we are going to measure the overall bdi dirty rate

        dirty_rate = N * task_ratelimit_200ms                   (2)

put (2) into (1) we get

        balanced_rate = write_bw / N                            (3)

So equation (1) is the right estimation to get the desired target (3).


As for

                                                  write_bw
        balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
                                                  dirty_rate

Let's compare it with the "expanded" form of (1):

                                                              write_bw
        balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
                                                              dirty_rate

So the difference lies in pos_ratio.

Believe it or not, it's exactly the seemingly use of pos_ratio that
makes (5) independent(*) of the position control.

Why? Look at (4), assume the system is in a state

- dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
- dirty position is not balanced, for example pos_ratio = 0.5

balance_dirty_pages() will be rate limiting each tasks at half the
balanced dirty rate, yielding a measured

        dirty_rate = write_bw / 2                               (6)

Put (6) into (4), we get

        balanced_rate_(i+1) = balanced_rate_(i) * 2
                            = (write_bw / N) * 2

That means, any position imbalance will lead to balanced_rate
estimation errors if we follow (4). Whereas if (1)/(5) is used, we
always get the right balanced dirty ratelimit value whether or not
(pos_ratio == 1.0), hence make the rate estimation independent(*) of
dirty position control.

(*) independent as in real values, not the seemingly relations in equation

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-24  0:12                           ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-24  0:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

> You somehow directly jump to  
> 
> 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> 
> without explaining why following will not work.
> 
> 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate

Thanks for asking that, it's probably the root of confusions, so let
me answer it standalone.

It's actually pretty simple to explain this equation:

                                               write_bw
        balanced_rate = task_ratelimit_200ms * ----------       (1)
                                               dirty_rate

If there are N dd tasks, each task is throttled at task_ratelimit_200ms
for the past 200ms, we are going to measure the overall bdi dirty rate

        dirty_rate = N * task_ratelimit_200ms                   (2)

put (2) into (1) we get

        balanced_rate = write_bw / N                            (3)

So equation (1) is the right estimation to get the desired target (3).


As for

                                                  write_bw
        balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
                                                  dirty_rate

Let's compare it with the "expanded" form of (1):

                                                              write_bw
        balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
                                                              dirty_rate

So the difference lies in pos_ratio.

Believe it or not, it's exactly the seemingly use of pos_ratio that
makes (5) independent(*) of the position control.

Why? Look at (4), assume the system is in a state

- dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
- dirty position is not balanced, for example pos_ratio = 0.5

balance_dirty_pages() will be rate limiting each tasks at half the
balanced dirty rate, yielding a measured

        dirty_rate = write_bw / 2                               (6)

Put (6) into (4), we get

        balanced_rate_(i+1) = balanced_rate_(i) * 2
                            = (write_bw / N) * 2

That means, any position imbalance will lead to balanced_rate
estimation errors if we follow (4). Whereas if (1)/(5) is used, we
always get the right balanced dirty ratelimit value whether or not
(pos_ratio == 1.0), hence make the rate estimation independent(*) of
dirty position control.

(*) independent as in real values, not the seemingly relations in equation

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-23 14:15                       ` Wu Fengguang
@ 2011-08-23 17:47                         ` Vivek Goyal
  -1 siblings, 0 replies; 203+ messages in thread
From: Vivek Goyal @ 2011-08-23 17:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 10:15:04PM +0800, Wu Fengguang wrote:
> On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > > - not a factor at all for updating balanced_rate (whether or not we do (2))
> > >   well, in this concept: the balanced_rate formula inherently does not
> > >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> > >   based on the ratelimit executed for the past 200ms:
> > > 
> > >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> > 
> > Ok, this is where it all goes funny..
> > 
> > So if you want completely separated feedback loops I would expect
> 
> If call it feedback loops, then it's a series of independent feedback
> loops of depth 1.  Because each balanced_rate is a fresh estimation
> dependent solely on
> 
> - writeout bandwidth
> - N, the number of dd tasks
> 
> in the past 200ms.
> 
> As long as a CONSTANT ratelimit (whatever value it is) is executed in
> the past 200ms, we can get the same balanced_rate.
> 
>         balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate
> 
> The resulted balanced_rate is independent of how large the CONSTANT
> ratelimit is, because if we start with a doubled CONSTANT ratelimit,
> we'll see doubled dirty_rate and result in the same balanced_rate. 
> 
> In that manner, balance_rate_(i+1) is not really depending on the
> value of balance_rate_(i): whatever balance_rate_(i) is, we are going
> to get the same balance_rate_(i+1) if not considering estimation
> errors. Note that the estimation errors mainly come from the
> fluctuations in dirty_rate.
> 
> That may well be what's already in your mind, just that we disagree
> about the terms ;)
> 
> > something like:
> > 
> > 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> > 
> > The former is a complete feedback loop, expressing the new value in the
> > old value (*) with bw_ratio as feedback parameter; if we throttled too
> > much, the dirty_rate will have dropped and the bw_ratio will be <1
> > causing the balance_rate to drop increasing the dirty_rate, and vice
> > versa.
> 
> In principle, the bw_ratio works that way. However since
> balance_rate_(i) is not the exact _executed_ ratelimit in
> balance_dirty_pages().
> 
> > (*) which is the form I expected and why I thought your primary feedback
> > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
>  
> Because the executed ratelimit was rate_(i) * pos_ratio.
> 
> > With the above balance_rate is an independent variable that tracks the
> > write bandwidth. Now possibly you'd want a low-pass filter on that since
> > your bw_ratio is a bit funny in the head, but that's another story.
> 
> Yeah.
> 
> > Then when you use the balance_rate to actually throttle tasks you apply
> > your secondary control steering the dirty page count, yielding:
> > 
> > 	task_rate = balance_rate * pos_ratio
> 
> Right. Note the above formula is not a derived one, but an original
> one that later leads to pos_ratio showing up in the calculation of
> balanced_rate.
> 
> > >   and task_ratelimit_200ms happen to can be estimated from
> > > 
> > >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> > 
> > >   We may alternatively record every task_ratelimit executed in the
> > >   past 200ms and average them all to get task_ratelimit_200ms. In this
> > >   way we take the "superfluous" pos_ratio out of sight :) 
> > 
> > Right, so I'm not at all sure that makes sense, its not immediately
> > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> > clear to me why your primary feedback loop uses task_ratelimit_200ms at
> > all. 
> 
> task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
> by balance_dirty_pages(). So this is an original formula:
> 
>         task_ratelimit = balance_rate * pos_ratio
> 
> task_ratelimit_200ms is also used as an original data source in
> 
>         balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> 

I think above calculates to.

 task_ratelimit = balanced_rate * pos_ratio
or
 task_ratelimit = task_ratelimit_200ms * write_bw / dirty_rate * pos_ratio
or
 task_ratelimit = balance_rate * pos_ratio  * write_bw / dirty_rate * pos_ratio
or
								    2
 task_ratelimit = balance_rate * write_bw / dirty_rate * (pos_ratio)

And the question is why not.

 task_ratelimit = prev-balance_rate * write_bw / dirty_rate * pos_ratio

Which sounds intutive as comapred to former one.

You somehow directly jump to  

	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

without explaining why following will not work.

	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate

Thanks
Vivek

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-23 17:47                         ` Vivek Goyal
  0 siblings, 0 replies; 203+ messages in thread
From: Vivek Goyal @ 2011-08-23 17:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 10:15:04PM +0800, Wu Fengguang wrote:
> On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > > - not a factor at all for updating balanced_rate (whether or not we do (2))
> > >   well, in this concept: the balanced_rate formula inherently does not
> > >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> > >   based on the ratelimit executed for the past 200ms:
> > > 
> > >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> > 
> > Ok, this is where it all goes funny..
> > 
> > So if you want completely separated feedback loops I would expect
> 
> If call it feedback loops, then it's a series of independent feedback
> loops of depth 1.  Because each balanced_rate is a fresh estimation
> dependent solely on
> 
> - writeout bandwidth
> - N, the number of dd tasks
> 
> in the past 200ms.
> 
> As long as a CONSTANT ratelimit (whatever value it is) is executed in
> the past 200ms, we can get the same balanced_rate.
> 
>         balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate
> 
> The resulted balanced_rate is independent of how large the CONSTANT
> ratelimit is, because if we start with a doubled CONSTANT ratelimit,
> we'll see doubled dirty_rate and result in the same balanced_rate. 
> 
> In that manner, balance_rate_(i+1) is not really depending on the
> value of balance_rate_(i): whatever balance_rate_(i) is, we are going
> to get the same balance_rate_(i+1) if not considering estimation
> errors. Note that the estimation errors mainly come from the
> fluctuations in dirty_rate.
> 
> That may well be what's already in your mind, just that we disagree
> about the terms ;)
> 
> > something like:
> > 
> > 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> > 
> > The former is a complete feedback loop, expressing the new value in the
> > old value (*) with bw_ratio as feedback parameter; if we throttled too
> > much, the dirty_rate will have dropped and the bw_ratio will be <1
> > causing the balance_rate to drop increasing the dirty_rate, and vice
> > versa.
> 
> In principle, the bw_ratio works that way. However since
> balance_rate_(i) is not the exact _executed_ ratelimit in
> balance_dirty_pages().
> 
> > (*) which is the form I expected and why I thought your primary feedback
> > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
>  
> Because the executed ratelimit was rate_(i) * pos_ratio.
> 
> > With the above balance_rate is an independent variable that tracks the
> > write bandwidth. Now possibly you'd want a low-pass filter on that since
> > your bw_ratio is a bit funny in the head, but that's another story.
> 
> Yeah.
> 
> > Then when you use the balance_rate to actually throttle tasks you apply
> > your secondary control steering the dirty page count, yielding:
> > 
> > 	task_rate = balance_rate * pos_ratio
> 
> Right. Note the above formula is not a derived one, but an original
> one that later leads to pos_ratio showing up in the calculation of
> balanced_rate.
> 
> > >   and task_ratelimit_200ms happen to can be estimated from
> > > 
> > >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> > 
> > >   We may alternatively record every task_ratelimit executed in the
> > >   past 200ms and average them all to get task_ratelimit_200ms. In this
> > >   way we take the "superfluous" pos_ratio out of sight :) 
> > 
> > Right, so I'm not at all sure that makes sense, its not immediately
> > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> > clear to me why your primary feedback loop uses task_ratelimit_200ms at
> > all. 
> 
> task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
> by balance_dirty_pages(). So this is an original formula:
> 
>         task_ratelimit = balance_rate * pos_ratio
> 
> task_ratelimit_200ms is also used as an original data source in
> 
>         balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> 

I think above calculates to.

 task_ratelimit = balanced_rate * pos_ratio
or
 task_ratelimit = task_ratelimit_200ms * write_bw / dirty_rate * pos_ratio
or
 task_ratelimit = balance_rate * pos_ratio  * write_bw / dirty_rate * pos_ratio
or
								    2
 task_ratelimit = balance_rate * write_bw / dirty_rate * (pos_ratio)

And the question is why not.

 task_ratelimit = prev-balance_rate * write_bw / dirty_rate * pos_ratio

Which sounds intutive as comapred to former one.

You somehow directly jump to  

	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

without explaining why following will not work.

	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-23 10:01                     ` Peter Zijlstra
@ 2011-08-23 14:36                       ` Vivek Goyal
  -1 siblings, 0 replies; 203+ messages in thread
From: Vivek Goyal @ 2011-08-23 14:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 12:01:00PM +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > - not a factor at all for updating balanced_rate (whether or not we do (2))
> >   well, in this concept: the balanced_rate formula inherently does not
> >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> >   based on the ratelimit executed for the past 200ms:
> > 
> >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> 
> Ok, this is where it all goes funny..

Exactly. This is where it gets confusing and is bone of contention.

> 
> So if you want completely separated feedback loops I would expect
> something like:
> 
> 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> 

I agree. This makes sense. IOW.
						      write_bw
bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_(n-1) * -------
						      dirty_rate

> The former is a complete feedback loop, expressing the new value in the
> old value (*) with bw_ratio as feedback parameter; if we throttled too
> much, the dirty_rate will have dropped and the bw_ratio will be <1
> causing the balance_rate to drop increasing the dirty_rate, and vice
> versa.

I think you meant.

"if we throttled too much, the dirty_rate will have dropped and the bw_ratio
 will be >1 causing the balance_rate to increase hence increasing the
 dirty_rate, and vice versa."

> 
> (*) which is the form I expected and why I thought your primary feedback
> loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
> 
> With the above balance_rate is an independent variable that tracks the
> write bandwidth. Now possibly you'd want a low-pass filter on that since
> your bw_ratio is a bit funny in the head, but that's another story.
> 
> Then when you use the balance_rate to actually throttle tasks you apply
> your secondary control steering the dirty page count, yielding:
> 
> 	task_rate = balance_rate * pos_ratio
> 
> >   and task_ratelimit_200ms happen to can be estimated from
> > 
> >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> 
> >   We may alternatively record every task_ratelimit executed in the
> >   past 200ms and average them all to get task_ratelimit_200ms. In this
> >   way we take the "superfluous" pos_ratio out of sight :) 
> 
> Right, so I'm not at all sure that makes sense, its not immediately
> evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> clear to me why your primary feedback loop uses task_ratelimit_200ms at
> all. 
> 

We I thought that this is evident that.

task_ratelimit = balanced_rate * pos_ratio

What is not evident to me is following.

balanced_rate_(i+1) = task_ratelimit_200ms * pos_ratio.

Instead, like you, I also thought that following is more obivious.

balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio

Thanks
Vivek

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-23 14:36                       ` Vivek Goyal
  0 siblings, 0 replies; 203+ messages in thread
From: Vivek Goyal @ 2011-08-23 14:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 12:01:00PM +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > - not a factor at all for updating balanced_rate (whether or not we do (2))
> >   well, in this concept: the balanced_rate formula inherently does not
> >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> >   based on the ratelimit executed for the past 200ms:
> > 
> >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> 
> Ok, this is where it all goes funny..

Exactly. This is where it gets confusing and is bone of contention.

> 
> So if you want completely separated feedback loops I would expect
> something like:
> 
> 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> 

I agree. This makes sense. IOW.
						      write_bw
bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_(n-1) * -------
						      dirty_rate

> The former is a complete feedback loop, expressing the new value in the
> old value (*) with bw_ratio as feedback parameter; if we throttled too
> much, the dirty_rate will have dropped and the bw_ratio will be <1
> causing the balance_rate to drop increasing the dirty_rate, and vice
> versa.

I think you meant.

"if we throttled too much, the dirty_rate will have dropped and the bw_ratio
 will be >1 causing the balance_rate to increase hence increasing the
 dirty_rate, and vice versa."

> 
> (*) which is the form I expected and why I thought your primary feedback
> loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
> 
> With the above balance_rate is an independent variable that tracks the
> write bandwidth. Now possibly you'd want a low-pass filter on that since
> your bw_ratio is a bit funny in the head, but that's another story.
> 
> Then when you use the balance_rate to actually throttle tasks you apply
> your secondary control steering the dirty page count, yielding:
> 
> 	task_rate = balance_rate * pos_ratio
> 
> >   and task_ratelimit_200ms happen to can be estimated from
> > 
> >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> 
> >   We may alternatively record every task_ratelimit executed in the
> >   past 200ms and average them all to get task_ratelimit_200ms. In this
> >   way we take the "superfluous" pos_ratio out of sight :) 
> 
> Right, so I'm not at all sure that makes sense, its not immediately
> evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> clear to me why your primary feedback loop uses task_ratelimit_200ms at
> all. 
> 

We I thought that this is evident that.

task_ratelimit = balanced_rate * pos_ratio

What is not evident to me is following.

balanced_rate_(i+1) = task_ratelimit_200ms * pos_ratio.

Instead, like you, I also thought that following is more obivious.

balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-23 10:01                     ` Peter Zijlstra
@ 2011-08-23 14:15                       ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-23 14:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > - not a factor at all for updating balanced_rate (whether or not we do (2))
> >   well, in this concept: the balanced_rate formula inherently does not
> >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> >   based on the ratelimit executed for the past 200ms:
> > 
> >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> 
> Ok, this is where it all goes funny..
> 
> So if you want completely separated feedback loops I would expect

If call it feedback loops, then it's a series of independent feedback
loops of depth 1.  Because each balanced_rate is a fresh estimation
dependent solely on

- writeout bandwidth
- N, the number of dd tasks

in the past 200ms.

As long as a CONSTANT ratelimit (whatever value it is) is executed in
the past 200ms, we can get the same balanced_rate.

        balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate

The resulted balanced_rate is independent of how large the CONSTANT
ratelimit is, because if we start with a doubled CONSTANT ratelimit,
we'll see doubled dirty_rate and result in the same balanced_rate. 

In that manner, balance_rate_(i+1) is not really depending on the
value of balance_rate_(i): whatever balance_rate_(i) is, we are going
to get the same balance_rate_(i+1) if not considering estimation
errors. Note that the estimation errors mainly come from the
fluctuations in dirty_rate.

That may well be what's already in your mind, just that we disagree
about the terms ;)

> something like:
> 
> 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> 
> The former is a complete feedback loop, expressing the new value in the
> old value (*) with bw_ratio as feedback parameter; if we throttled too
> much, the dirty_rate will have dropped and the bw_ratio will be <1
> causing the balance_rate to drop increasing the dirty_rate, and vice
> versa.

In principle, the bw_ratio works that way. However since
balance_rate_(i) is not the exact _executed_ ratelimit in
balance_dirty_pages().

> (*) which is the form I expected and why I thought your primary feedback
> loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
 
Because the executed ratelimit was rate_(i) * pos_ratio.

> With the above balance_rate is an independent variable that tracks the
> write bandwidth. Now possibly you'd want a low-pass filter on that since
> your bw_ratio is a bit funny in the head, but that's another story.

Yeah.

> Then when you use the balance_rate to actually throttle tasks you apply
> your secondary control steering the dirty page count, yielding:
> 
> 	task_rate = balance_rate * pos_ratio

Right. Note the above formula is not a derived one, but an original
one that later leads to pos_ratio showing up in the calculation of
balanced_rate.

> >   and task_ratelimit_200ms happen to can be estimated from
> > 
> >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> 
> >   We may alternatively record every task_ratelimit executed in the
> >   past 200ms and average them all to get task_ratelimit_200ms. In this
> >   way we take the "superfluous" pos_ratio out of sight :) 
> 
> Right, so I'm not at all sure that makes sense, its not immediately
> evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> clear to me why your primary feedback loop uses task_ratelimit_200ms at
> all. 

task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
by balance_dirty_pages(). So this is an original formula:

        task_ratelimit = balance_rate * pos_ratio

task_ratelimit_200ms is also used as an original data source in

        balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

Then we try to estimate task_ratelimit_200ms by assuming all tasks
have been executing the same CONSTANT ratelimit in
balance_dirty_pages(). Hence we get

        task_ratelimit_200ms ~= prev_balance_rate * pos_ratio

> >   There is fundamentally no dependency between balanced_rate_(i+1) and
> >   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
> >   only asks for _whatever_ CONSTANT task ratelimit to be executed for
> >   200ms, then it get the balanced rate from the dirty_rate feedback.
> 
> How can there not be a relation between balance_rate_(i+1) and
> balance_rate_(i) ? 

In this manner: even though balance_rate_(i) is somehow used for
calculating balance_rate_(i+1), the latter will evaluate to the same
value given whatever balance_rate_(i).

That is, there is two dependencies, the seemingly dependency in the
formula, and the effective dependency in the data values.

Thank,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-23 14:15                       ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-23 14:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > - not a factor at all for updating balanced_rate (whether or not we do (2))
> >   well, in this concept: the balanced_rate formula inherently does not
> >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> >   based on the ratelimit executed for the past 200ms:
> > 
> >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> 
> Ok, this is where it all goes funny..
> 
> So if you want completely separated feedback loops I would expect

If call it feedback loops, then it's a series of independent feedback
loops of depth 1.  Because each balanced_rate is a fresh estimation
dependent solely on

- writeout bandwidth
- N, the number of dd tasks

in the past 200ms.

As long as a CONSTANT ratelimit (whatever value it is) is executed in
the past 200ms, we can get the same balanced_rate.

        balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate

The resulted balanced_rate is independent of how large the CONSTANT
ratelimit is, because if we start with a doubled CONSTANT ratelimit,
we'll see doubled dirty_rate and result in the same balanced_rate. 

In that manner, balance_rate_(i+1) is not really depending on the
value of balance_rate_(i): whatever balance_rate_(i) is, we are going
to get the same balance_rate_(i+1) if not considering estimation
errors. Note that the estimation errors mainly come from the
fluctuations in dirty_rate.

That may well be what's already in your mind, just that we disagree
about the terms ;)

> something like:
> 
> 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> 
> The former is a complete feedback loop, expressing the new value in the
> old value (*) with bw_ratio as feedback parameter; if we throttled too
> much, the dirty_rate will have dropped and the bw_ratio will be <1
> causing the balance_rate to drop increasing the dirty_rate, and vice
> versa.

In principle, the bw_ratio works that way. However since
balance_rate_(i) is not the exact _executed_ ratelimit in
balance_dirty_pages().

> (*) which is the form I expected and why I thought your primary feedback
> loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
 
Because the executed ratelimit was rate_(i) * pos_ratio.

> With the above balance_rate is an independent variable that tracks the
> write bandwidth. Now possibly you'd want a low-pass filter on that since
> your bw_ratio is a bit funny in the head, but that's another story.

Yeah.

> Then when you use the balance_rate to actually throttle tasks you apply
> your secondary control steering the dirty page count, yielding:
> 
> 	task_rate = balance_rate * pos_ratio

Right. Note the above formula is not a derived one, but an original
one that later leads to pos_ratio showing up in the calculation of
balanced_rate.

> >   and task_ratelimit_200ms happen to can be estimated from
> > 
> >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> 
> >   We may alternatively record every task_ratelimit executed in the
> >   past 200ms and average them all to get task_ratelimit_200ms. In this
> >   way we take the "superfluous" pos_ratio out of sight :) 
> 
> Right, so I'm not at all sure that makes sense, its not immediately
> evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> clear to me why your primary feedback loop uses task_ratelimit_200ms at
> all. 

task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
by balance_dirty_pages(). So this is an original formula:

        task_ratelimit = balance_rate * pos_ratio

task_ratelimit_200ms is also used as an original data source in

        balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

Then we try to estimate task_ratelimit_200ms by assuming all tasks
have been executing the same CONSTANT ratelimit in
balance_dirty_pages(). Hence we get

        task_ratelimit_200ms ~= prev_balance_rate * pos_ratio

> >   There is fundamentally no dependency between balanced_rate_(i+1) and
> >   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
> >   only asks for _whatever_ CONSTANT task ratelimit to be executed for
> >   200ms, then it get the balanced rate from the dirty_rate feedback.
> 
> How can there not be a relation between balance_rate_(i+1) and
> balance_rate_(i) ? 

In this manner: even though balance_rate_(i) is somehow used for
calculating balance_rate_(i+1), the latter will evaluate to the same
value given whatever balance_rate_(i).

That is, there is two dependencies, the seemingly dependency in the
formula, and the effective dependency in the data values.

Thank,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-23  3:40                   ` Wu Fengguang
  (?)
@ 2011-08-23 10:01                     ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-23 10:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> - not a factor at all for updating balanced_rate (whether or not we do (2))
>   well, in this concept: the balanced_rate formula inherently does not
>   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
>   based on the ratelimit executed for the past 200ms:
> 
>           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio

Ok, this is where it all goes funny..

So if you want completely separated feedback loops I would expect
something like:

	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms

The former is a complete feedback loop, expressing the new value in the
old value (*) with bw_ratio as feedback parameter; if we throttled too
much, the dirty_rate will have dropped and the bw_ratio will be <1
causing the balance_rate to drop increasing the dirty_rate, and vice
versa.

(*) which is the form I expected and why I thought your primary feedback
loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio

With the above balance_rate is an independent variable that tracks the
write bandwidth. Now possibly you'd want a low-pass filter on that since
your bw_ratio is a bit funny in the head, but that's another story.

Then when you use the balance_rate to actually throttle tasks you apply
your secondary control steering the dirty page count, yielding:

	task_rate = balance_rate * pos_ratio

>   and task_ratelimit_200ms happen to can be estimated from
> 
>           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio

>   We may alternatively record every task_ratelimit executed in the
>   past 200ms and average them all to get task_ratelimit_200ms. In this
>   way we take the "superfluous" pos_ratio out of sight :) 

Right, so I'm not at all sure that makes sense, its not immediately
evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
clear to me why your primary feedback loop uses task_ratelimit_200ms at
all. 

>   There is fundamentally no dependency between balanced_rate_(i+1) and
>   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
>   only asks for _whatever_ CONSTANT task ratelimit to be executed for
>   200ms, then it get the balanced rate from the dirty_rate feedback.

How can there not be a relation between balance_rate_(i+1) and
balance_rate_(i) ? 

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-23 10:01                     ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-23 10:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> - not a factor at all for updating balanced_rate (whether or not we do (2))
>   well, in this concept: the balanced_rate formula inherently does not
>   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
>   based on the ratelimit executed for the past 200ms:
> 
>           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio

Ok, this is where it all goes funny..

So if you want completely separated feedback loops I would expect
something like:

	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms

The former is a complete feedback loop, expressing the new value in the
old value (*) with bw_ratio as feedback parameter; if we throttled too
much, the dirty_rate will have dropped and the bw_ratio will be <1
causing the balance_rate to drop increasing the dirty_rate, and vice
versa.

(*) which is the form I expected and why I thought your primary feedback
loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio

With the above balance_rate is an independent variable that tracks the
write bandwidth. Now possibly you'd want a low-pass filter on that since
your bw_ratio is a bit funny in the head, but that's another story.

Then when you use the balance_rate to actually throttle tasks you apply
your secondary control steering the dirty page count, yielding:

	task_rate = balance_rate * pos_ratio

>   and task_ratelimit_200ms happen to can be estimated from
> 
>           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio

>   We may alternatively record every task_ratelimit executed in the
>   past 200ms and average them all to get task_ratelimit_200ms. In this
>   way we take the "superfluous" pos_ratio out of sight :) 

Right, so I'm not at all sure that makes sense, its not immediately
evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
clear to me why your primary feedback loop uses task_ratelimit_200ms at
all. 

>   There is fundamentally no dependency between balanced_rate_(i+1) and
>   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
>   only asks for _whatever_ CONSTANT task ratelimit to be executed for
>   200ms, then it get the balanced rate from the dirty_rate feedback.

How can there not be a relation between balance_rate_(i+1) and
balance_rate_(i) ? 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-23 10:01                     ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-23 10:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> - not a factor at all for updating balanced_rate (whether or not we do (2))
>   well, in this concept: the balanced_rate formula inherently does not
>   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
>   based on the ratelimit executed for the past 200ms:
> 
>           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio

Ok, this is where it all goes funny..

So if you want completely separated feedback loops I would expect
something like:

	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms

The former is a complete feedback loop, expressing the new value in the
old value (*) with bw_ratio as feedback parameter; if we throttled too
much, the dirty_rate will have dropped and the bw_ratio will be <1
causing the balance_rate to drop increasing the dirty_rate, and vice
versa.

(*) which is the form I expected and why I thought your primary feedback
loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio

With the above balance_rate is an independent variable that tracks the
write bandwidth. Now possibly you'd want a low-pass filter on that since
your bw_ratio is a bit funny in the head, but that's another story.

Then when you use the balance_rate to actually throttle tasks you apply
your secondary control steering the dirty page count, yielding:

	task_rate = balance_rate * pos_ratio

>   and task_ratelimit_200ms happen to can be estimated from
> 
>           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio

>   We may alternatively record every task_ratelimit executed in the
>   past 200ms and average them all to get task_ratelimit_200ms. In this
>   way we take the "superfluous" pos_ratio out of sight :) 

Right, so I'm not at all sure that makes sense, its not immediately
evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
clear to me why your primary feedback loop uses task_ratelimit_200ms at
all. 

>   There is fundamentally no dependency between balanced_rate_(i+1) and
>   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
>   only asks for _whatever_ CONSTANT task ratelimit to be executed for
>   200ms, then it get the balanced rate from the dirty_rate feedback.

How can there not be a relation between balance_rate_(i+1) and
balance_rate_(i) ? 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-22 15:38                 ` Peter Zijlstra
@ 2011-08-23  3:40                   ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-23  3:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 22, 2011 at 11:38:07PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 22:20 +0800, Wu Fengguang wrote:
> > On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote:
> > > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> > 
> > To start with,
> > 
> >                                                 write_bw
> >         ref_bw = task_ratelimit_in_past_200ms * --------
> >                                                 dirty_bw
> > 
> > where
> >         task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio
> > 
> > > > Now all of the above would seem to suggest:
> > > > 
> > > >   dirty_ratelimit := ref_bw
> > 
> > Right, ideally ref_bw is the balanced dirty ratelimit. I actually
> > started with exactly the above equation when I got choked by pure
> > pos_bw based feedback control (as mentioned in the reply to Jan's
> > email) and introduced the ref_bw estimation as the way out.
> > 
> > But there are some imperfections in ref_bw, too. Which makes it not
> > suitable for direct use:
> > 
> > 1) large fluctuations
> 
> OK, understood.
> 
> > 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw)
> > becomes unbalanced match, which leads to large systematical errors
> > in ref_bw. The truncates, due to its possibly bumpy nature, can hardly
> > be compensated smoothly.
> 
> OK.
> 
> > 3) since we ultimately want to
> > 
> > - keep the dirty pages around the setpoint as long time as possible
> > - keep the fluctuations of task ratelimit as small as possible
> 
> Fair enough ;-)
> 
> > the update policy used for (2) also serves the above goals nicely:
> > if for some reason the dirty pages are high (pos_bw < dirty_ratelimit),
> > and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no
> > point to bring up dirty_ratelimit in a hurry and to hurt both the
> > above two goals.
> 
> Right, so still I feel somewhat befuddled, so we have:
> 
> 	dirty_ratelimit - rate at which we throttle dirtiers as
> 			  estimated upto 200ms ago.

Note that bdi->dirty_ratelimit is supposed to be the balanced
ratelimit, ie. (write_bw / N), regardless whether dirty pages meets
the setpoint.

In _concept_, the bdi balanced ratelimit is updated _independent_ of
the position control embodied in the task ratelimit calculation.

A lot of confusions seem to come from the seemingly inter-twisted rate
and position controls, however in my mind, there are two levels of
relationship:

1) work fundamentally independent of each other, each tries to fulfill
   one single target (either balanced rate or balanced position)

2) _based_ on (1), completely optional, try to constraint the rate update 
   to get more stable ->dirty_ratelimit and more balanced dirty position

Note that (2) is not a must even if there are systematic errors in
balanced_rate calculation. For example, the v8 patchset only does (1)
and hence do simple

        bdi->dirty_ratelimit = balanced_rate;

And it can still balance at some point (though not exactly around the setpoint):

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G-bs=1M/ext4-1dd-1M-8p-2942M-20:10-3.0.0-next-20110802+-2011-08-08.19:47/balance_dirty_pages-pages.png

Even if ext4 has mis-matched (dirty_rate:write_bw ~= 3:2) hence
introduced systematic errors in balanced_rate:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G-bs=1M/ext4-1dd-1M-8p-2942M-20:10-3.0.0-next-20110802+-2011-08-08.19:47/global_dirtied_written.png

> 	pos_ratio	- ratio adjusting the dirty_ratelimit
> 			  for variance in dirty pages around its target

So pos_ratio is

- is a _limiting_ factor rather than an _adjusting_ factor for
  updating ->dirty_ratelimit (when do (2))

- not a factor at all for updating balanced_rate (whether or not we do (2))
  well, in this concept: the balanced_rate formula inherently does not
  derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
  based on the ratelimit executed for the past 200ms:

          balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio

  and task_ratelimit_200ms happen to can be estimated from

          task_ratelimit_200ms ~= balanced_rate_i * pos_ratio

  There is fundamentally no dependency between balanced_rate_(i+1) and
  balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
  only asks for _whatever_ CONSTANT task ratelimit to be executed for
  200ms, then it get the balanced rate from the dirty_rate feedback.

  We may alternatively record every task_ratelimit executed in the
  past 200ms and average them all to get task_ratelimit_200ms. In this
  way we take the "superfluous" pos_ratio out of sight :)

> 	bw_ratio	- ratio adjusting the dirty_ratelimit
> 			  for variance in input/output bandwidth
> 
> and we need to basically do:
> 
> 	dirty_ratelimit *= pos_ratio * bw_ratio

So there is even no such recursing at all:

        balanced_rate *= bw_ratio

Each balanced_rate is estimated from the start, based on each 200ms period.

> to update the dirty_ratelimit to reflect the current state. However per
> 1) and 2) bw_ratio is crappy and hard to fix.
> 
> So you propose to update dirty_ratelimit only if both pos_ratio and
> bw_ratio point in the same direction, however that would result in:
> 
>   if (pos_ratio < UNIT && bw_ratio < UNIT ||
>       pos_ratio > UNIT && bw_ratio > UNIT) {
> 	dirty_ratelimit = (dirty_ratelimit * pos_ratio) / UNIT;
> 	dirty_ratelimit = (dirty_ratelimit * bw_ratio) / UNIT;
>   }

We start by doing this for (1):

        dirty_ratelimit = balanced_rate

and then try to refine it for (1)+(2):

        dirty_ratelimit => balanced_rate, but limit the progress by pos_ratio

> > > > However for that you use:
> > > > 
> > > >   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> > > >         dirty_ratelimit = max(ref_bw, pos_bw);
> > > > 
> > > >   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> > > >         dirty_ratelimit = min(ref_bw, pos_bw);
> > 
> > The above are merely constraints to the dirty_ratelimit update.
> > It serves to
> > 
> > 1) stop adjusting the rate when it's against the position control
> >    target (the adjusted rate will slow down the progress of dirty
> >    pages going back to setpoint).
> 
> Not strictly speaking, suppose pos_ratio = 0.5 and bw_ratio = 1.1, then
> they point in different directions however:
> 
>  0.5 < 1 &&  0.5 * 1.1 < 1
> 
> so your code will in fact update the dirty_ratelimit, even though the
> two factors point in opposite directions.

It does not work that way since pos_ratio does not take part in the
multiplication. However I admit that the tests

        (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
        (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)

don't aim to avoid all unnecessary updates, and it may even stop some
rightful updates. It's not possible at all to act perfect. It's merely
a rule that sounds "reasonable" in theory and works reasonably good in
practice :) I'd be happy to try more if there are better ones.

> > 2) limit the step size. pos_bw is changing values step by step,
> >    leaving a consistent trace comparing to the randomly jumping
> >    ref_bw. pos_bw also has smaller errors in stable state and normally
> >    have larger errors when there are big errors in rate. So it's a
> >    pretty good limiting factor for the step size of dirty_ratelimit.
> 
> OK, so that's the min/max stuff, however it only works because you use
> pos_bw and ref_bw instead of the fully separated factors.

Yes, the min/max stuff is for limiting the step size. The "limiting"
intention can be made more clear if written as

        delta = balanced_rate - base_rate;

        if (delta > pos_rate - base_rate)
            delta = pos_rate - base_rate;

        delta /= 8;

> > Hope the above elaboration helps :)
> 
> A little.. 

And now? ;)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-23  3:40                   ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-23  3:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 22, 2011 at 11:38:07PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 22:20 +0800, Wu Fengguang wrote:
> > On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote:
> > > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> > 
> > To start with,
> > 
> >                                                 write_bw
> >         ref_bw = task_ratelimit_in_past_200ms * --------
> >                                                 dirty_bw
> > 
> > where
> >         task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio
> > 
> > > > Now all of the above would seem to suggest:
> > > > 
> > > >   dirty_ratelimit := ref_bw
> > 
> > Right, ideally ref_bw is the balanced dirty ratelimit. I actually
> > started with exactly the above equation when I got choked by pure
> > pos_bw based feedback control (as mentioned in the reply to Jan's
> > email) and introduced the ref_bw estimation as the way out.
> > 
> > But there are some imperfections in ref_bw, too. Which makes it not
> > suitable for direct use:
> > 
> > 1) large fluctuations
> 
> OK, understood.
> 
> > 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw)
> > becomes unbalanced match, which leads to large systematical errors
> > in ref_bw. The truncates, due to its possibly bumpy nature, can hardly
> > be compensated smoothly.
> 
> OK.
> 
> > 3) since we ultimately want to
> > 
> > - keep the dirty pages around the setpoint as long time as possible
> > - keep the fluctuations of task ratelimit as small as possible
> 
> Fair enough ;-)
> 
> > the update policy used for (2) also serves the above goals nicely:
> > if for some reason the dirty pages are high (pos_bw < dirty_ratelimit),
> > and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no
> > point to bring up dirty_ratelimit in a hurry and to hurt both the
> > above two goals.
> 
> Right, so still I feel somewhat befuddled, so we have:
> 
> 	dirty_ratelimit - rate at which we throttle dirtiers as
> 			  estimated upto 200ms ago.

Note that bdi->dirty_ratelimit is supposed to be the balanced
ratelimit, ie. (write_bw / N), regardless whether dirty pages meets
the setpoint.

In _concept_, the bdi balanced ratelimit is updated _independent_ of
the position control embodied in the task ratelimit calculation.

A lot of confusions seem to come from the seemingly inter-twisted rate
and position controls, however in my mind, there are two levels of
relationship:

1) work fundamentally independent of each other, each tries to fulfill
   one single target (either balanced rate or balanced position)

2) _based_ on (1), completely optional, try to constraint the rate update 
   to get more stable ->dirty_ratelimit and more balanced dirty position

Note that (2) is not a must even if there are systematic errors in
balanced_rate calculation. For example, the v8 patchset only does (1)
and hence do simple

        bdi->dirty_ratelimit = balanced_rate;

And it can still balance at some point (though not exactly around the setpoint):

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G-bs=1M/ext4-1dd-1M-8p-2942M-20:10-3.0.0-next-20110802+-2011-08-08.19:47/balance_dirty_pages-pages.png

Even if ext4 has mis-matched (dirty_rate:write_bw ~= 3:2) hence
introduced systematic errors in balanced_rate:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G-bs=1M/ext4-1dd-1M-8p-2942M-20:10-3.0.0-next-20110802+-2011-08-08.19:47/global_dirtied_written.png

> 	pos_ratio	- ratio adjusting the dirty_ratelimit
> 			  for variance in dirty pages around its target

So pos_ratio is

- is a _limiting_ factor rather than an _adjusting_ factor for
  updating ->dirty_ratelimit (when do (2))

- not a factor at all for updating balanced_rate (whether or not we do (2))
  well, in this concept: the balanced_rate formula inherently does not
  derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
  based on the ratelimit executed for the past 200ms:

          balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio

  and task_ratelimit_200ms happen to can be estimated from

          task_ratelimit_200ms ~= balanced_rate_i * pos_ratio

  There is fundamentally no dependency between balanced_rate_(i+1) and
  balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
  only asks for _whatever_ CONSTANT task ratelimit to be executed for
  200ms, then it get the balanced rate from the dirty_rate feedback.

  We may alternatively record every task_ratelimit executed in the
  past 200ms and average them all to get task_ratelimit_200ms. In this
  way we take the "superfluous" pos_ratio out of sight :)

> 	bw_ratio	- ratio adjusting the dirty_ratelimit
> 			  for variance in input/output bandwidth
> 
> and we need to basically do:
> 
> 	dirty_ratelimit *= pos_ratio * bw_ratio

So there is even no such recursing at all:

        balanced_rate *= bw_ratio

Each balanced_rate is estimated from the start, based on each 200ms period.

> to update the dirty_ratelimit to reflect the current state. However per
> 1) and 2) bw_ratio is crappy and hard to fix.
> 
> So you propose to update dirty_ratelimit only if both pos_ratio and
> bw_ratio point in the same direction, however that would result in:
> 
>   if (pos_ratio < UNIT && bw_ratio < UNIT ||
>       pos_ratio > UNIT && bw_ratio > UNIT) {
> 	dirty_ratelimit = (dirty_ratelimit * pos_ratio) / UNIT;
> 	dirty_ratelimit = (dirty_ratelimit * bw_ratio) / UNIT;
>   }

We start by doing this for (1):

        dirty_ratelimit = balanced_rate

and then try to refine it for (1)+(2):

        dirty_ratelimit => balanced_rate, but limit the progress by pos_ratio

> > > > However for that you use:
> > > > 
> > > >   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> > > >         dirty_ratelimit = max(ref_bw, pos_bw);
> > > > 
> > > >   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> > > >         dirty_ratelimit = min(ref_bw, pos_bw);
> > 
> > The above are merely constraints to the dirty_ratelimit update.
> > It serves to
> > 
> > 1) stop adjusting the rate when it's against the position control
> >    target (the adjusted rate will slow down the progress of dirty
> >    pages going back to setpoint).
> 
> Not strictly speaking, suppose pos_ratio = 0.5 and bw_ratio = 1.1, then
> they point in different directions however:
> 
>  0.5 < 1 &&  0.5 * 1.1 < 1
> 
> so your code will in fact update the dirty_ratelimit, even though the
> two factors point in opposite directions.

It does not work that way since pos_ratio does not take part in the
multiplication. However I admit that the tests

        (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
        (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)

don't aim to avoid all unnecessary updates, and it may even stop some
rightful updates. It's not possible at all to act perfect. It's merely
a rule that sounds "reasonable" in theory and works reasonably good in
practice :) I'd be happy to try more if there are better ones.

> > 2) limit the step size. pos_bw is changing values step by step,
> >    leaving a consistent trace comparing to the randomly jumping
> >    ref_bw. pos_bw also has smaller errors in stable state and normally
> >    have larger errors when there are big errors in rate. So it's a
> >    pretty good limiting factor for the step size of dirty_ratelimit.
> 
> OK, so that's the min/max stuff, however it only works because you use
> pos_bw and ref_bw instead of the fully separated factors.

Yes, the min/max stuff is for limiting the step size. The "limiting"
intention can be made more clear if written as

        delta = balanced_rate - base_rate;

        if (delta > pos_rate - base_rate)
            delta = pos_rate - base_rate;

        delta /= 8;

> > Hope the above elaboration helps :)
> 
> A little.. 

And now? ;)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12 14:20               ` Wu Fengguang
  (?)
@ 2011-08-22 15:38                 ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-22 15:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 22:20 +0800, Wu Fengguang wrote:
> On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> 
> To start with,
> 
>                                                 write_bw
>         ref_bw = task_ratelimit_in_past_200ms * --------
>                                                 dirty_bw
> 
> where
>         task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio
> 
> > > Now all of the above would seem to suggest:
> > > 
> > >   dirty_ratelimit := ref_bw
> 
> Right, ideally ref_bw is the balanced dirty ratelimit. I actually
> started with exactly the above equation when I got choked by pure
> pos_bw based feedback control (as mentioned in the reply to Jan's
> email) and introduced the ref_bw estimation as the way out.
> 
> But there are some imperfections in ref_bw, too. Which makes it not
> suitable for direct use:
> 
> 1) large fluctuations

OK, understood.

> 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw)
> becomes unbalanced match, which leads to large systematical errors
> in ref_bw. The truncates, due to its possibly bumpy nature, can hardly
> be compensated smoothly.

OK.

> 3) since we ultimately want to
> 
> - keep the dirty pages around the setpoint as long time as possible
> - keep the fluctuations of task ratelimit as small as possible

Fair enough ;-)

> the update policy used for (2) also serves the above goals nicely:
> if for some reason the dirty pages are high (pos_bw < dirty_ratelimit),
> and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no
> point to bring up dirty_ratelimit in a hurry and to hurt both the
> above two goals.

Right, so still I feel somewhat befuddled, so we have:

	dirty_ratelimit - rate at which we throttle dirtiers as
			  estimated upto 200ms ago.

	pos_ratio	- ratio adjusting the dirty_ratelimit
			  for variance in dirty pages around its target

	bw_ratio	- ratio adjusting the dirty_ratelimit
			  for variance in input/output bandwidth

and we need to basically do:

	dirty_ratelimit *= pos_ratio * bw_ratio

to update the dirty_ratelimit to reflect the current state. However per
1) and 2) bw_ratio is crappy and hard to fix.

So you propose to update dirty_ratelimit only if both pos_ratio and
bw_ratio point in the same direction, however that would result in:

  if (pos_ratio < UNIT && bw_ratio < UNIT ||
      pos_ratio > UNIT && bw_ratio > UNIT) {
	dirty_ratelimit = (dirty_ratelimit * pos_ratio) / UNIT;
	dirty_ratelimit = (dirty_ratelimit * bw_ratio) / UNIT;
  }

> > > However for that you use:
> > > 
> > >   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> > >         dirty_ratelimit = max(ref_bw, pos_bw);
> > > 
> > >   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> > >         dirty_ratelimit = min(ref_bw, pos_bw);
> 
> The above are merely constraints to the dirty_ratelimit update.
> It serves to
> 
> 1) stop adjusting the rate when it's against the position control
>    target (the adjusted rate will slow down the progress of dirty
>    pages going back to setpoint).

Not strictly speaking, suppose pos_ratio = 0.5 and bw_ratio = 1.1, then
they point in different directions however:

 0.5 < 1 &&  0.5 * 1.1 < 1

so your code will in fact update the dirty_ratelimit, even though the
two factors point in opposite directions.

> 2) limit the step size. pos_bw is changing values step by step,
>    leaving a consistent trace comparing to the randomly jumping
>    ref_bw. pos_bw also has smaller errors in stable state and normally
>    have larger errors when there are big errors in rate. So it's a
>    pretty good limiting factor for the step size of dirty_ratelimit.

OK, so that's the min/max stuff, however it only works because you use
pos_bw and ref_bw instead of the fully separated factors.

> Hope the above elaboration helps :)

A little.. 

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-22 15:38                 ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-22 15:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 22:20 +0800, Wu Fengguang wrote:
> On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> 
> To start with,
> 
>                                                 write_bw
>         ref_bw = task_ratelimit_in_past_200ms * --------
>                                                 dirty_bw
> 
> where
>         task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio
> 
> > > Now all of the above would seem to suggest:
> > > 
> > >   dirty_ratelimit := ref_bw
> 
> Right, ideally ref_bw is the balanced dirty ratelimit. I actually
> started with exactly the above equation when I got choked by pure
> pos_bw based feedback control (as mentioned in the reply to Jan's
> email) and introduced the ref_bw estimation as the way out.
> 
> But there are some imperfections in ref_bw, too. Which makes it not
> suitable for direct use:
> 
> 1) large fluctuations

OK, understood.

> 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw)
> becomes unbalanced match, which leads to large systematical errors
> in ref_bw. The truncates, due to its possibly bumpy nature, can hardly
> be compensated smoothly.

OK.

> 3) since we ultimately want to
> 
> - keep the dirty pages around the setpoint as long time as possible
> - keep the fluctuations of task ratelimit as small as possible

Fair enough ;-)

> the update policy used for (2) also serves the above goals nicely:
> if for some reason the dirty pages are high (pos_bw < dirty_ratelimit),
> and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no
> point to bring up dirty_ratelimit in a hurry and to hurt both the
> above two goals.

Right, so still I feel somewhat befuddled, so we have:

	dirty_ratelimit - rate at which we throttle dirtiers as
			  estimated upto 200ms ago.

	pos_ratio	- ratio adjusting the dirty_ratelimit
			  for variance in dirty pages around its target

	bw_ratio	- ratio adjusting the dirty_ratelimit
			  for variance in input/output bandwidth

and we need to basically do:

	dirty_ratelimit *= pos_ratio * bw_ratio

to update the dirty_ratelimit to reflect the current state. However per
1) and 2) bw_ratio is crappy and hard to fix.

So you propose to update dirty_ratelimit only if both pos_ratio and
bw_ratio point in the same direction, however that would result in:

  if (pos_ratio < UNIT && bw_ratio < UNIT ||
      pos_ratio > UNIT && bw_ratio > UNIT) {
	dirty_ratelimit = (dirty_ratelimit * pos_ratio) / UNIT;
	dirty_ratelimit = (dirty_ratelimit * bw_ratio) / UNIT;
  }

> > > However for that you use:
> > > 
> > >   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> > >         dirty_ratelimit = max(ref_bw, pos_bw);
> > > 
> > >   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> > >         dirty_ratelimit = min(ref_bw, pos_bw);
> 
> The above are merely constraints to the dirty_ratelimit update.
> It serves to
> 
> 1) stop adjusting the rate when it's against the position control
>    target (the adjusted rate will slow down the progress of dirty
>    pages going back to setpoint).

Not strictly speaking, suppose pos_ratio = 0.5 and bw_ratio = 1.1, then
they point in different directions however:

 0.5 < 1 &&  0.5 * 1.1 < 1

so your code will in fact update the dirty_ratelimit, even though the
two factors point in opposite directions.

> 2) limit the step size. pos_bw is changing values step by step,
>    leaving a consistent trace comparing to the randomly jumping
>    ref_bw. pos_bw also has smaller errors in stable state and normally
>    have larger errors when there are big errors in rate. So it's a
>    pretty good limiting factor for the step size of dirty_ratelimit.

OK, so that's the min/max stuff, however it only works because you use
pos_bw and ref_bw instead of the fully separated factors.

> Hope the above elaboration helps :)

A little.. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-22 15:38                 ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-22 15:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 22:20 +0800, Wu Fengguang wrote:
> On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> 
> To start with,
> 
>                                                 write_bw
>         ref_bw = task_ratelimit_in_past_200ms * --------
>                                                 dirty_bw
> 
> where
>         task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio
> 
> > > Now all of the above would seem to suggest:
> > > 
> > >   dirty_ratelimit := ref_bw
> 
> Right, ideally ref_bw is the balanced dirty ratelimit. I actually
> started with exactly the above equation when I got choked by pure
> pos_bw based feedback control (as mentioned in the reply to Jan's
> email) and introduced the ref_bw estimation as the way out.
> 
> But there are some imperfections in ref_bw, too. Which makes it not
> suitable for direct use:
> 
> 1) large fluctuations

OK, understood.

> 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw)
> becomes unbalanced match, which leads to large systematical errors
> in ref_bw. The truncates, due to its possibly bumpy nature, can hardly
> be compensated smoothly.

OK.

> 3) since we ultimately want to
> 
> - keep the dirty pages around the setpoint as long time as possible
> - keep the fluctuations of task ratelimit as small as possible

Fair enough ;-)

> the update policy used for (2) also serves the above goals nicely:
> if for some reason the dirty pages are high (pos_bw < dirty_ratelimit),
> and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no
> point to bring up dirty_ratelimit in a hurry and to hurt both the
> above two goals.

Right, so still I feel somewhat befuddled, so we have:

	dirty_ratelimit - rate at which we throttle dirtiers as
			  estimated upto 200ms ago.

	pos_ratio	- ratio adjusting the dirty_ratelimit
			  for variance in dirty pages around its target

	bw_ratio	- ratio adjusting the dirty_ratelimit
			  for variance in input/output bandwidth

and we need to basically do:

	dirty_ratelimit *= pos_ratio * bw_ratio

to update the dirty_ratelimit to reflect the current state. However per
1) and 2) bw_ratio is crappy and hard to fix.

So you propose to update dirty_ratelimit only if both pos_ratio and
bw_ratio point in the same direction, however that would result in:

  if (pos_ratio < UNIT && bw_ratio < UNIT ||
      pos_ratio > UNIT && bw_ratio > UNIT) {
	dirty_ratelimit = (dirty_ratelimit * pos_ratio) / UNIT;
	dirty_ratelimit = (dirty_ratelimit * bw_ratio) / UNIT;
  }

> > > However for that you use:
> > > 
> > >   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> > >         dirty_ratelimit = max(ref_bw, pos_bw);
> > > 
> > >   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> > >         dirty_ratelimit = min(ref_bw, pos_bw);
> 
> The above are merely constraints to the dirty_ratelimit update.
> It serves to
> 
> 1) stop adjusting the rate when it's against the position control
>    target (the adjusted rate will slow down the progress of dirty
>    pages going back to setpoint).

Not strictly speaking, suppose pos_ratio = 0.5 and bw_ratio = 1.1, then
they point in different directions however:

 0.5 < 1 &&  0.5 * 1.1 < 1

so your code will in fact update the dirty_ratelimit, even though the
two factors point in opposite directions.

> 2) limit the step size. pos_bw is changing values step by step,
>    leaving a consistent trace comparing to the randomly jumping
>    ref_bw. pos_bw also has smaller errors in stable state and normally
>    have larger errors when there are big errors in rate. So it's a
>    pretty good limiting factor for the step size of dirty_ratelimit.

OK, so that's the min/max stuff, however it only works because you use
pos_bw and ref_bw instead of the fully separated factors.

> Hope the above elaboration helps :)

A little.. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-17 12:03   ` Jan Kara
@ 2011-08-17 12:35     ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-17 12:35 UTC (permalink / raw)
  To: Jan Kara; +Cc: David Horner, linux-kernel

On Wed, Aug 17, 2011 at 08:03:56PM +0800, Jan Kara wrote:
> On Wed 17-08-11 02:40:19, David Horner wrote:
> >  I noticed a significant typo below (another of those thousand eyes,
> > thanks to Jan Kara's post that started ne looking) :
> > 
> >  > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> >  > + unsigned long thresh,
> > ...
> >  > + * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> >  > + * own size, so move the slope over accordingly.
> >  > + */
> >  > + if (unlikely(bdi_thresh > thresh))
> >  > + bdi_thresh = thresh;
> >  > + /*
> >  > + * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> >  > + */
> >  > + x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > 
> >                   ^
> >  I believe should be
> > 
> >     x = div_u64((u64)bdi_thresh << 16, thresh + 1);
>   I've noticed this as well but it's mostly a consistency issue. 'thresh'
> is going to be large in practice so there's not much difference between
> thresh + 1 and thresh | 1.

Right :) Anyway I'll change it to thresh + 1.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-17  6:40 ` [PATCH 2/5] writeback: dirty position control David Horner
@ 2011-08-17 12:03   ` Jan Kara
  2011-08-17 12:35     ` Wu Fengguang
  0 siblings, 1 reply; 203+ messages in thread
From: Jan Kara @ 2011-08-17 12:03 UTC (permalink / raw)
  To: David Horner; +Cc: linux-kernel, fengguang.wu, jack

On Wed 17-08-11 02:40:19, David Horner wrote:
>  I noticed a significant typo below (another of those thousand eyes,
> thanks to Jan Kara's post that started ne looking) :
> 
>  > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
>  > + unsigned long thresh,
> ...
>  > + * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
>  > + * own size, so move the slope over accordingly.
>  > + */
>  > + if (unlikely(bdi_thresh > thresh))
>  > + bdi_thresh = thresh;
>  > + /*
>  > + * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
>  > + */
>  > + x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> 
>                   ^
>  I believe should be
> 
>     x = div_u64((u64)bdi_thresh << 16, thresh + 1);
  I've noticed this as well but it's mostly a consistency issue. 'thresh'
is going to be large in practice so there's not much difference between
thresh + 1 and thresh | 1.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
       [not found] <CAFdhcLRKvfqBnXCXLwq-Qe1eNAGC-8XJ3BtHpQKzaa3RhHyp6A@mail.gmail.com>
@ 2011-08-17  6:40 ` David Horner
  2011-08-17 12:03   ` Jan Kara
  0 siblings, 1 reply; 203+ messages in thread
From: David Horner @ 2011-08-17  6:40 UTC (permalink / raw)
  To: linux-kernel, fengguang.wu; +Cc: jack

 I noticed a significant typo below (another of those thousand eyes,
thanks to Jan Kara's post that started ne looking) :

 > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
 > + unsigned long thresh,
...
 > + * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
 > + * own size, so move the slope over accordingly.
 > + */
 > + if (unlikely(bdi_thresh > thresh))
 > + bdi_thresh = thresh;
 > + /*
 > + * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
 > + */
 > + x = div_u64((u64)bdi_thresh << 16, thresh | 1);

                  ^
 I believe should be

    x = div_u64((u64)bdi_thresh << 16, thresh + 1);

    David Horner

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-09  2:08     ` Vivek Goyal
@ 2011-08-16  8:59       ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  8:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

> > bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> > that the resulted task rate limit can drive the dirty pages back to the
> > global/bdi setpoints.
> > 
> 
> IMHO, "position_ratio" is not necessarily very intutive. Can there be
> a better name? Based on your slides, it is scaling factor applied to
> task rate limit depending on how well we are doing in terms of meeting
> our goal of dirty limit. Will "dirty_rate_scale_factor" or something like
> that make sense and be little more intutive? 

Yeah position_ratio is some scale factor to the dirty rate, and I
added a comment for that. On the other hand position_ratio does
reflect the underlying "position control of dirty pages" logic. So
over time it should be reasonably understandable in the other way :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-16  8:59       ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  8:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

> > bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> > that the resulted task rate limit can drive the dirty pages back to the
> > global/bdi setpoints.
> > 
> 
> IMHO, "position_ratio" is not necessarily very intutive. Can there be
> a better name? Based on your slides, it is scaling factor applied to
> task rate limit depending on how well we are doing in terms of meeting
> our goal of dirty limit. Will "dirty_rate_scale_factor" or something like
> that make sense and be little more intutive? 

Yeah position_ratio is some scale factor to the dirty rate, and I
added a comment for that. On the other hand position_ratio does
reflect the underlying "position control of dirty pages" logic. So
over time it should be reasonably understandable in the other way :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-10 21:40             ` Vivek Goyal
@ 2011-08-16  8:55               ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  8:55 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

Hi Vivek,

Sorry it made such a big confusion to you. I hope Peter's 3rd order
polynomial abstraction in v9 can clarify the concepts a lot.

As for the old global control line

                       origin - dirty
           pos_ratio = --------------           (1)
                       origin - goal

where

        origin = 4 * thresh                     (2)

effectively decides the slope of the line. The use of @limit in code

        origin = max(4 * thresh, limit)         (3)

is merely to safeguard the rare case that (2) might result in negative
pos_ratio in (1).

I have another patch to add a "brake" area immediately below @limit
that will scale pos_ratio down to 0. However that's no longer
necessary with the 3rd order polynomial solution. 

Note that @limit will normally be equal to @thresh except in the rare
case that @thresh is suddenly knocked down and @limit is taking time
to follow it.

Thanks,
Fengguang

> Hi Fengguang,
> 
> Ok, so just trying to understand this pos_ratio little better.
> 
> You have following basic formula.
> 
>                      origin - dirty
>          pos_ratio = --------------
>                      origin - goal
> 
> Terminology is very confusing and following is my understanding. 
> 
> - setpoint == goal
> 
>   setpoint is the point where we would like our number of dirty pages to
>   be and at this point pos_ratio = 1. For global dirty this number seems
>   to be (thresh - thresh / DIRTY_SCOPE) 
> 
> - thresh
>   dirty page threshold calculated from dirty_ratio (Certain percentage of
>   total memory).
> 
> - Origin (seems to be equivalent of limit)
> 
>   This seems to be the reference point/limit we don't want to cross and
>   distance from this limit basically decides the pos_ratio. Closer we
>   are to limit, lower the pos_ratio and further we are higher the
>   pos_ratio.
> 
> So threshold is just a number which helps us determine goal and limit.
> 
> goal = thresh - thresh / DIRTY_SCOPE
> limit = 4*thresh
> 
> So goal is where we want to be and we start throttling the task more as
> we move away goal and approach limit. We keep the limit high enough
> so that (origin-dirty) does not become negative entity.
> 
> So we do expect to cross "thresh" otherwise thresh itself could have
> served as limit?
> 
> If my understanding is right, then can we get rid of terms "setpoint" and
> "origin". Would it be easier to understand the things if we just talk
> in terms of "goal" and "limit" and how these are derived from "thresh".
> 
> 	thresh == soft limit
> 	limit == 4*thresh (hard limit)
> 	goal = thresh - thresh / DIRTY_SCOPE (where we want system to
> 						be in steady state).
>                      limit - dirty
>          pos_ratio = --------------
>                      limit - goal
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-16  8:55               ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  8:55 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

Hi Vivek,

Sorry it made such a big confusion to you. I hope Peter's 3rd order
polynomial abstraction in v9 can clarify the concepts a lot.

As for the old global control line

                       origin - dirty
           pos_ratio = --------------           (1)
                       origin - goal

where

        origin = 4 * thresh                     (2)

effectively decides the slope of the line. The use of @limit in code

        origin = max(4 * thresh, limit)         (3)

is merely to safeguard the rare case that (2) might result in negative
pos_ratio in (1).

I have another patch to add a "brake" area immediately below @limit
that will scale pos_ratio down to 0. However that's no longer
necessary with the 3rd order polynomial solution. 

Note that @limit will normally be equal to @thresh except in the rare
case that @thresh is suddenly knocked down and @limit is taking time
to follow it.

Thanks,
Fengguang

> Hi Fengguang,
> 
> Ok, so just trying to understand this pos_ratio little better.
> 
> You have following basic formula.
> 
>                      origin - dirty
>          pos_ratio = --------------
>                      origin - goal
> 
> Terminology is very confusing and following is my understanding. 
> 
> - setpoint == goal
> 
>   setpoint is the point where we would like our number of dirty pages to
>   be and at this point pos_ratio = 1. For global dirty this number seems
>   to be (thresh - thresh / DIRTY_SCOPE) 
> 
> - thresh
>   dirty page threshold calculated from dirty_ratio (Certain percentage of
>   total memory).
> 
> - Origin (seems to be equivalent of limit)
> 
>   This seems to be the reference point/limit we don't want to cross and
>   distance from this limit basically decides the pos_ratio. Closer we
>   are to limit, lower the pos_ratio and further we are higher the
>   pos_ratio.
> 
> So threshold is just a number which helps us determine goal and limit.
> 
> goal = thresh - thresh / DIRTY_SCOPE
> limit = 4*thresh
> 
> So goal is where we want to be and we start throttling the task more as
> we move away goal and approach limit. We keep the limit high enough
> so that (origin-dirty) does not become negative entity.
> 
> So we do expect to cross "thresh" otherwise thresh itself could have
> served as limit?
> 
> If my understanding is right, then can we get rid of terms "setpoint" and
> "origin". Would it be easier to understand the things if we just talk
> in terms of "goal" and "limit" and how these are derived from "thresh".
> 
> 	thresh == soft limit
> 	limit == 4*thresh (hard limit)
> 	goal = thresh - thresh / DIRTY_SCOPE (where we want system to
> 						be in steady state).
>                      limit - dirty
>          pos_ratio = --------------
>                      limit - goal
> 
> Thanks
> Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-11 11:14                   ` Jan Kara
@ 2011-08-16  8:35                     ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  8:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 11, 2011 at 07:14:23PM +0800, Jan Kara wrote:
> On Thu 11-08-11 10:29:52, Wu Fengguang wrote:
> > On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote:
> > > On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> > > > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > > > > >                     origin - dirty
> > > > > >         pos_ratio = --------------
> > > > > >                     origin - goal 
> > > > > 
> > > > > > which comes from the below [*] control line, so that when (dirty == goal),
> > > > > > pos_ratio == 1.0:
> > > > > 
> > > > > OK, so basically you want a linear function for which:
> > > > > 
> > > > > f(goal) = 1 and has a root somewhere > goal.
> > > > > 
> > > > > (that one line is much more informative than all your graphs put
> > > > > together, one can start from there and derive your function)
> > > > > 
> > > > > That does indeed get you the above function, now what does it mean? 
> > > > 
> > > > So going by:
> > > > 
> > > >                                          write_bw
> > > >   ref_bw = dirty_ratelimit * pos_ratio * --------
> > > >                                          dirty_bw
> > > 
> > >   Actually, thinking about these formulas, why do we even bother with
> > > computing all these factors like write_bw, dirty_bw, pos_ratio, ...
> > > Couldn't we just have a feedback loop (probably similar to the one
> > > computing pos_ratio) which will maintain single value - ratelimit? When we
> > > are getting close to dirty limit, we will scale ratelimit down, when we
> > > will be getting significantly below dirty limit, we will scale the
> > > ratelimit up.  Because looking at the formulas it seems to me that the net
> > > effect is the same - pos_ratio basically overrules everything... 
> > 
> > Good question. That is actually one of the early approaches I tried.
> > It somehow worked, however the resulted ratelimit is not only slow
> > responding, but also oscillating all the time.
>   Yes, I think I vaguely remember that.
> 
> > This is due to the imperfections
> > 
> > 1) pos_ratio at best only provides a "direction" for adjusting the
> >    ratelimit. There is only vague clues that if pos_ratio is small,
> >    the errors in ratelimit should be small.
> > 
> > 2) Due to time-lag, the assumptions in (1) about "direction" and
> >    "error size" can be wrong. The ratelimit may already be
> >    over-adjusted when the dirty pages take time to approach the
> >    setpoint. The larger memory, the more time lag, the easier to
> >    overshoot and oscillate.
> > 
> > 3) dirty pages are constantly fluctuating around the setpoint,
> >    so is pos_ratio.
> > 
> > With (1) and (2), it's a control system very susceptible to disturbs.
> > With (3) we get constant disturbs. Well I had very hard time and
> > played dirty tricks (which you may never want to know ;-) trying to
> > tradeoff between response time and stableness..
>   Yes, I can see especially 2) is a problem. But I don't understand why
> your current formula would be that much different. As Peter decoded from
> your code, your current formula is:
>                                         write_bw
>  ref_bw = dirty_ratelimit * pos_ratio * --------
>                                         dirty_bw
> 
> while previously it was essentially:
>  ref_bw = dirty_ratelimit * pos_ratio

Sorry what's the code you are referring to? Does the changelog in the
newly posted patchset make the ref_bw calculation and dirty_ratelimit
updating more clear?

> So what is so magical about computing write_bw and dirty_bw separately? Is
> it because previously you did not use derivation of distance from the goal
> for updating pos_ratio? Because in your current formula write_bw/dirty_bw
> is a derivation of position...

dirty_bw is the main feedback. If we are throttling too much, the
resulting dirty_bw will be lowered than write_bw. Thus 

                                      write_bw
   ref_bw = ratelimit_in_past_200ms * --------
                                      dirty_bw

will give us a higher ref_bw than ratelimit_in_past_200ms. For pure
dd workload, the computed ref_bw by the above formula is exactly the
balanced rate (if not considering trivial errors).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-16  8:35                     ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-16  8:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 11, 2011 at 07:14:23PM +0800, Jan Kara wrote:
> On Thu 11-08-11 10:29:52, Wu Fengguang wrote:
> > On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote:
> > > On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> > > > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > > > > >                     origin - dirty
> > > > > >         pos_ratio = --------------
> > > > > >                     origin - goal 
> > > > > 
> > > > > > which comes from the below [*] control line, so that when (dirty == goal),
> > > > > > pos_ratio == 1.0:
> > > > > 
> > > > > OK, so basically you want a linear function for which:
> > > > > 
> > > > > f(goal) = 1 and has a root somewhere > goal.
> > > > > 
> > > > > (that one line is much more informative than all your graphs put
> > > > > together, one can start from there and derive your function)
> > > > > 
> > > > > That does indeed get you the above function, now what does it mean? 
> > > > 
> > > > So going by:
> > > > 
> > > >                                          write_bw
> > > >   ref_bw = dirty_ratelimit * pos_ratio * --------
> > > >                                          dirty_bw
> > > 
> > >   Actually, thinking about these formulas, why do we even bother with
> > > computing all these factors like write_bw, dirty_bw, pos_ratio, ...
> > > Couldn't we just have a feedback loop (probably similar to the one
> > > computing pos_ratio) which will maintain single value - ratelimit? When we
> > > are getting close to dirty limit, we will scale ratelimit down, when we
> > > will be getting significantly below dirty limit, we will scale the
> > > ratelimit up.  Because looking at the formulas it seems to me that the net
> > > effect is the same - pos_ratio basically overrules everything... 
> > 
> > Good question. That is actually one of the early approaches I tried.
> > It somehow worked, however the resulted ratelimit is not only slow
> > responding, but also oscillating all the time.
>   Yes, I think I vaguely remember that.
> 
> > This is due to the imperfections
> > 
> > 1) pos_ratio at best only provides a "direction" for adjusting the
> >    ratelimit. There is only vague clues that if pos_ratio is small,
> >    the errors in ratelimit should be small.
> > 
> > 2) Due to time-lag, the assumptions in (1) about "direction" and
> >    "error size" can be wrong. The ratelimit may already be
> >    over-adjusted when the dirty pages take time to approach the
> >    setpoint. The larger memory, the more time lag, the easier to
> >    overshoot and oscillate.
> > 
> > 3) dirty pages are constantly fluctuating around the setpoint,
> >    so is pos_ratio.
> > 
> > With (1) and (2), it's a control system very susceptible to disturbs.
> > With (3) we get constant disturbs. Well I had very hard time and
> > played dirty tricks (which you may never want to know ;-) trying to
> > tradeoff between response time and stableness..
>   Yes, I can see especially 2) is a problem. But I don't understand why
> your current formula would be that much different. As Peter decoded from
> your code, your current formula is:
>                                         write_bw
>  ref_bw = dirty_ratelimit * pos_ratio * --------
>                                         dirty_bw
> 
> while previously it was essentially:
>  ref_bw = dirty_ratelimit * pos_ratio

Sorry what's the code you are referring to? Does the changelog in the
newly posted patchset make the ref_bw calculation and dirty_ratelimit
updating more clear?

> So what is so magical about computing write_bw and dirty_bw separately? Is
> it because previously you did not use derivation of distance from the goal
> for updating pos_ratio? Because in your current formula write_bw/dirty_bw
> is a derivation of position...

dirty_bw is the main feedback. If we are throttling too much, the
resulting dirty_bw will be lowered than write_bw. Thus 

                                      write_bw
   ref_bw = ratelimit_in_past_200ms * --------
                                      dirty_bw

will give us a higher ref_bw than ratelimit_in_past_200ms. For pure
dd workload, the computed ref_bw by the above formula is exactly the
balanced rate (if not considering trivial errors).

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12 13:04             ` Peter Zijlstra
@ 2011-08-12 14:20               ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-12 14:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

Peter,

Sorry for the delay..

On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:

To start with,

                                                write_bw
        ref_bw = task_ratelimit_in_past_200ms * --------
                                                dirty_bw

where
        task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio

> > Now all of the above would seem to suggest:
> > 
> >   dirty_ratelimit := ref_bw

Right, ideally ref_bw is the balanced dirty ratelimit. I actually
started with exactly the above equation when I got choked by pure
pos_bw based feedback control (as mentioned in the reply to Jan's
email) and introduced the ref_bw estimation as the way out.

But there are some imperfections in ref_bw, too. Which makes it not
suitable for direct use:

1) large fluctuations

The dirty_bw used for computing ref_bw is merely averaged in the
past 200ms (very small comparing to the 3s estimation period in
write_bw), which makes rather dispersed distribution of ref_bw.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/8G/ext4-10dd-4k-32p-6802M-20:10-3.0.0-next-20110802+-2011-08-06.16:48/balance_dirty_pages-pages.png

Take a look at the blue [*] points in the above graph. I find it pretty
hard to average out the singular points by increasing the estimation
period. Considering that the averaging technique will introduce the
very undesirable time lags, I give it up totally. (btw, the write_bw
averaging time lag is much more acceptable because its impact is
one-way and therefore won't lead to oscillations.)

The one practical way is filtering -- the most large singular ref_bw
points can be filtered out effectively by remembering some prev_ref_bw
and prev_prev_ref_bw. However it cannot do away all of them. And the
remaining majority ref_bw points are still randomly dancing around the
ideal balanced rate. 

2) due to truncates and fs redirties, the (write_bw <=> dirty_bw)
becomes unbalanced match, which leads to large systematical errors
in ref_bw. The truncates, due to its possibly bumpy nature, can hardly
be compensated smoothly. So let's face it. When some over-estimated
ref_bw brings ->dirty_ratelimit high, higher than the setpoint, the
pos_bw will in turn become lower than ->dirty_ratelimit. So if we
consider both ref_bw and pos_bw and update ->dirty_ratelimit only when
they are on the same side of ->dirty_ratelimit, the systematical
errors in ref_bw won't be able to bring ->dirty_ratelimit too away.

The ref_bw estimation is also not accurate when near the max pause and
free run areas.

3) since we ultimately want to

- keep the dirty pages around the setpoint as long time as possible
- keep the fluctuations of task ratelimit as small as possible

the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (pos_bw < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no
point to bring up dirty_ratelimit in a hurry and to hurt both the
above two goals.

> > However for that you use:
> > 
> >   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> >         dirty_ratelimit = max(ref_bw, pos_bw);
> > 
> >   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> >         dirty_ratelimit = min(ref_bw, pos_bw);

The above are merely constraints to the dirty_ratelimit update.
It serves to

1) stop adjusting the rate when it's against the position control
   target (the adjusted rate will slow down the progress of dirty
   pages going back to setpoint).

2) limit the step size. pos_bw is changing values step by step,
   leaving a consistent trace comparing to the randomly jumping
   ref_bw. pos_bw also has smaller errors in stable state and normally
   have larger errors when there are big errors in rate. So it's a
   pretty good limiting factor for the step size of dirty_ratelimit.

> > You have:
> > 
> >   pos_bw = dirty_ratelimit * pos_ratio
> > 
> > Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
> > why are you ignoring the shift in output vs input rate there? 

Again, you need to understand pos_bw the other way.  Only (pos_bw -
dirty_ratelimit) matters here, which is exactly the position error.

> Could you elaborate on this primary feedback loop? Its the one part I
> don't feel I actually understand well.

Hope the above elaboration helps :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 14:20               ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-12 14:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

Peter,

Sorry for the delay..

On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:

To start with,

                                                write_bw
        ref_bw = task_ratelimit_in_past_200ms * --------
                                                dirty_bw

where
        task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio

> > Now all of the above would seem to suggest:
> > 
> >   dirty_ratelimit := ref_bw

Right, ideally ref_bw is the balanced dirty ratelimit. I actually
started with exactly the above equation when I got choked by pure
pos_bw based feedback control (as mentioned in the reply to Jan's
email) and introduced the ref_bw estimation as the way out.

But there are some imperfections in ref_bw, too. Which makes it not
suitable for direct use:

1) large fluctuations

The dirty_bw used for computing ref_bw is merely averaged in the
past 200ms (very small comparing to the 3s estimation period in
write_bw), which makes rather dispersed distribution of ref_bw.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/8G/ext4-10dd-4k-32p-6802M-20:10-3.0.0-next-20110802+-2011-08-06.16:48/balance_dirty_pages-pages.png

Take a look at the blue [*] points in the above graph. I find it pretty
hard to average out the singular points by increasing the estimation
period. Considering that the averaging technique will introduce the
very undesirable time lags, I give it up totally. (btw, the write_bw
averaging time lag is much more acceptable because its impact is
one-way and therefore won't lead to oscillations.)

The one practical way is filtering -- the most large singular ref_bw
points can be filtered out effectively by remembering some prev_ref_bw
and prev_prev_ref_bw. However it cannot do away all of them. And the
remaining majority ref_bw points are still randomly dancing around the
ideal balanced rate. 

2) due to truncates and fs redirties, the (write_bw <=> dirty_bw)
becomes unbalanced match, which leads to large systematical errors
in ref_bw. The truncates, due to its possibly bumpy nature, can hardly
be compensated smoothly. So let's face it. When some over-estimated
ref_bw brings ->dirty_ratelimit high, higher than the setpoint, the
pos_bw will in turn become lower than ->dirty_ratelimit. So if we
consider both ref_bw and pos_bw and update ->dirty_ratelimit only when
they are on the same side of ->dirty_ratelimit, the systematical
errors in ref_bw won't be able to bring ->dirty_ratelimit too away.

The ref_bw estimation is also not accurate when near the max pause and
free run areas.

3) since we ultimately want to

- keep the dirty pages around the setpoint as long time as possible
- keep the fluctuations of task ratelimit as small as possible

the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (pos_bw < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no
point to bring up dirty_ratelimit in a hurry and to hurt both the
above two goals.

> > However for that you use:
> > 
> >   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> >         dirty_ratelimit = max(ref_bw, pos_bw);
> > 
> >   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> >         dirty_ratelimit = min(ref_bw, pos_bw);

The above are merely constraints to the dirty_ratelimit update.
It serves to

1) stop adjusting the rate when it's against the position control
   target (the adjusted rate will slow down the progress of dirty
   pages going back to setpoint).

2) limit the step size. pos_bw is changing values step by step,
   leaving a consistent trace comparing to the randomly jumping
   ref_bw. pos_bw also has smaller errors in stable state and normally
   have larger errors when there are big errors in rate. So it's a
   pretty good limiting factor for the step size of dirty_ratelimit.

> > You have:
> > 
> >   pos_bw = dirty_ratelimit * pos_ratio
> > 
> > Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
> > why are you ignoring the shift in output vs input rate there? 

Again, you need to understand pos_bw the other way.  Only (pos_bw -
dirty_ratelimit) matters here, which is exactly the position error.

> Could you elaborate on this primary feedback loop? Its the one part I
> don't feel I actually understand well.

Hope the above elaboration helps :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-09 17:20             ` Peter Zijlstra
@ 2011-08-12 13:19               ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-12 13:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 01:20:27AM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > >                     origin - dirty
> > >         pos_ratio = --------------
> > >                     origin - goal 
> > 
> > > which comes from the below [*] control line, so that when (dirty == goal),
> > > pos_ratio == 1.0:
> > 
> > OK, so basically you want a linear function for which:
> > 
> > f(goal) = 1 and has a root somewhere > goal.
> > 
> > (that one line is much more informative than all your graphs put
> > together, one can start from there and derive your function)
> > 
> > That does indeed get you the above function, now what does it mean? 
> 
> So going by:
> 
>                                          write_bw
>   ref_bw = dirty_ratelimit * pos_ratio * --------
>                                          dirty_bw
> 
> pos_ratio seems to be the feedback on the deviation of the dirty pages
> around its setpoint.

Yes.

> So we adjust the reference bw (or rather ratelimit)
> to take account of the shift in output vs input capacity as well as the
> shift in dirty pages around its setpoint.

However the above function should better be interpreted as

                                            write_bw
    ref_bw = task_ratelimit_in_past_200ms * --------
                                            dirty_bw

where
        task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio

It would be highly confusing if trying to find the direct "logical"
relationships between ref_bw and pos_ratio in the above equation.

> From that we derive the condition that: 
> 
>   pos_ratio(setpoint) := 1

Right.

> Now in order to create a linear function we need one more condition. We
> get one from the fact that once we hit the limit we should hard throttle
> our writers. We get that by setting the ratelimit to 0, because, after
> all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> 
>   pos_ratio(limit) := 0
> 
> Using these two conditions we can solve the equations and get your:
> 
>                         limit - dirty
>   pos_ratio(dirty) =  ----------------
>                       limit - setpoint
> 
> Now, for some reason you chose not to use limit, but something like
> min(limit, 4*thresh) something to do with the slope affecting the rate
> of adjustment. This wants a comment someplace.

Thanks to your reasoning that lead to the more elegant 

                            setpoint - dirty 3
   pos_ratio(dirty) := 1 + (----------------)
                            limit - setpoint

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 13:19               ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-12 13:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 01:20:27AM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > >                     origin - dirty
> > >         pos_ratio = --------------
> > >                     origin - goal 
> > 
> > > which comes from the below [*] control line, so that when (dirty == goal),
> > > pos_ratio == 1.0:
> > 
> > OK, so basically you want a linear function for which:
> > 
> > f(goal) = 1 and has a root somewhere > goal.
> > 
> > (that one line is much more informative than all your graphs put
> > together, one can start from there and derive your function)
> > 
> > That does indeed get you the above function, now what does it mean? 
> 
> So going by:
> 
>                                          write_bw
>   ref_bw = dirty_ratelimit * pos_ratio * --------
>                                          dirty_bw
> 
> pos_ratio seems to be the feedback on the deviation of the dirty pages
> around its setpoint.

Yes.

> So we adjust the reference bw (or rather ratelimit)
> to take account of the shift in output vs input capacity as well as the
> shift in dirty pages around its setpoint.

However the above function should better be interpreted as

                                            write_bw
    ref_bw = task_ratelimit_in_past_200ms * --------
                                            dirty_bw

where
        task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio

It would be highly confusing if trying to find the direct "logical"
relationships between ref_bw and pos_ratio in the above equation.

> From that we derive the condition that: 
> 
>   pos_ratio(setpoint) := 1

Right.

> Now in order to create a linear function we need one more condition. We
> get one from the fact that once we hit the limit we should hard throttle
> our writers. We get that by setting the ratelimit to 0, because, after
> all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> 
>   pos_ratio(limit) := 0
> 
> Using these two conditions we can solve the equations and get your:
> 
>                         limit - dirty
>   pos_ratio(dirty) =  ----------------
>                       limit - setpoint
> 
> Now, for some reason you chose not to use limit, but something like
> min(limit, 4*thresh) something to do with the slope affecting the rate
> of adjustment. This wants a comment someplace.

Thanks to your reasoning that lead to the more elegant 

                            setpoint - dirty 3
   pos_ratio(dirty) := 1 + (----------------)
                            limit - setpoint

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12 12:59               ` Wu Fengguang
  (?)
@ 2011-08-12 13:08                 ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-12 13:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 20:59 +0800, Wu Fengguang wrote:
> On Fri, Aug 12, 2011 at 08:54:17PM +0800, Peter Zijlstra wrote:
> > On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> > > 
> > >                s - x 3
> > >  f(x) :=  1 + (-----)
> > >                  d
> > > 
> > btw, if you want steeper slopes for rampup and brake you can add another
> > factor like:
> > 
> >                  s - x 3
> >   f(x) :=  1 + a(-----)
> >                    d
> >  
> > And solve the whole f(l)=0 thing again to determine d in l and a.
> > 
> > For 0 < a < 1 the slopes increase.
> 
> Yes, we can leave it as a future tuning option. For now I'm pretty
> satisfied with the current function's shape :)

Oh for sure, it just occurred to me when looking at your plots and
thought I'd at least mention it.. You know something to poke at on a
rainy afternoon ;-)

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 13:08                 ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-12 13:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 20:59 +0800, Wu Fengguang wrote:
> On Fri, Aug 12, 2011 at 08:54:17PM +0800, Peter Zijlstra wrote:
> > On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> > > 
> > >                s - x 3
> > >  f(x) :=  1 + (-----)
> > >                  d
> > > 
> > btw, if you want steeper slopes for rampup and brake you can add another
> > factor like:
> > 
> >                  s - x 3
> >   f(x) :=  1 + a(-----)
> >                    d
> >  
> > And solve the whole f(l)=0 thing again to determine d in l and a.
> > 
> > For 0 < a < 1 the slopes increase.
> 
> Yes, we can leave it as a future tuning option. For now I'm pretty
> satisfied with the current function's shape :)

Oh for sure, it just occurred to me when looking at your plots and
thought I'd at least mention it.. You know something to poke at on a
rainy afternoon ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 13:08                 ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-12 13:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 20:59 +0800, Wu Fengguang wrote:
> On Fri, Aug 12, 2011 at 08:54:17PM +0800, Peter Zijlstra wrote:
> > On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> > > 
> > >                s - x 3
> > >  f(x) :=  1 + (-----)
> > >                  d
> > > 
> > btw, if you want steeper slopes for rampup and brake you can add another
> > factor like:
> > 
> >                  s - x 3
> >   f(x) :=  1 + a(-----)
> >                    d
> >  
> > And solve the whole f(l)=0 thing again to determine d in l and a.
> > 
> > For 0 < a < 1 the slopes increase.
> 
> Yes, we can leave it as a future tuning option. For now I'm pretty
> satisfied with the current function's shape :)

Oh for sure, it just occurred to me when looking at your plots and
thought I'd at least mention it.. You know something to poke at on a
rainy afternoon ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 23:05           ` Wu Fengguang
  (?)
@ 2011-08-12 13:04             ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-12 13:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> 
> Now all of the above would seem to suggest:
> 
>   dirty_ratelimit := ref_bw
> 
> However for that you use:
> 
>   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
>         dirty_ratelimit = max(ref_bw, pos_bw);
> 
>   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
>         dirty_ratelimit = min(ref_bw, pos_bw);
> 
> You have:
> 
>   pos_bw = dirty_ratelimit * pos_ratio
> 
> Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
> why are you ignoring the shift in output vs input rate there? 

Could you elaborate on this primary feedback loop? Its the one part I
don't feel I actually understand well.



^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 13:04             ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-12 13:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> 
> Now all of the above would seem to suggest:
> 
>   dirty_ratelimit := ref_bw
> 
> However for that you use:
> 
>   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
>         dirty_ratelimit = max(ref_bw, pos_bw);
> 
>   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
>         dirty_ratelimit = min(ref_bw, pos_bw);
> 
> You have:
> 
>   pos_bw = dirty_ratelimit * pos_ratio
> 
> Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
> why are you ignoring the shift in output vs input rate there? 

Could you elaborate on this primary feedback loop? Its the one part I
don't feel I actually understand well.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 13:04             ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-12 13:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> 
> Now all of the above would seem to suggest:
> 
>   dirty_ratelimit := ref_bw
> 
> However for that you use:
> 
>   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
>         dirty_ratelimit = max(ref_bw, pos_bw);
> 
>   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
>         dirty_ratelimit = min(ref_bw, pos_bw);
> 
> You have:
> 
>   pos_bw = dirty_ratelimit * pos_ratio
> 
> Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
> why are you ignoring the shift in output vs input rate there? 

Could you elaborate on this primary feedback loop? Its the one part I
don't feel I actually understand well.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12 12:54             ` Peter Zijlstra
@ 2011-08-12 12:59               ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-12 12:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 08:54:17PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> > 
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                  d
> > 
> btw, if you want steeper slopes for rampup and brake you can add another
> factor like:
> 
>                  s - x 3
>   f(x) :=  1 + a(-----)
>                    d
>  
> And solve the whole f(l)=0 thing again to determine d in l and a.
> 
> For 0 < a < 1 the slopes increase.

Yes, we can leave it as a future tuning option. For now I'm pretty
satisfied with the current function's shape :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 12:59               ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-12 12:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 08:54:17PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> > 
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                  d
> > 
> btw, if you want steeper slopes for rampup and brake you can add another
> factor like:
> 
>                  s - x 3
>   f(x) :=  1 + a(-----)
>                    d
>  
> And solve the whole f(l)=0 thing again to determine d in l and a.
> 
> For 0 < a < 1 the slopes increase.

Yes, we can leave it as a future tuning option. For now I'm pretty
satisfied with the current function's shape :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 23:05           ` Wu Fengguang
  (?)
@ 2011-08-12 12:54             ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-12 12:54 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> 
>                s - x 3
>  f(x) :=  1 + (-----)
>                  d
> 
btw, if you want steeper slopes for rampup and brake you can add another
factor like:

                 s - x 3
  f(x) :=  1 + a(-----)
                   d
 
And solve the whole f(l)=0 thing again to determine d in l and a.

For 0 < a < 1 the slopes increase.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 12:54             ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-12 12:54 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> 
>                s - x 3
>  f(x) :=  1 + (-----)
>                  d
> 
btw, if you want steeper slopes for rampup and brake you can add another
factor like:

                 s - x 3
  f(x) :=  1 + a(-----)
                   d
 
And solve the whole f(l)=0 thing again to determine d in l and a.

For 0 < a < 1 the slopes increase.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 12:54             ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-12 12:54 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> 
>                s - x 3
>  f(x) :=  1 + (-----)
>                  d
> 
btw, if you want steeper slopes for rampup and brake you can add another
factor like:

                 s - x 3
  f(x) :=  1 + a(-----)
                   d
 
And solve the whole f(l)=0 thing again to determine d in l and a.

For 0 < a < 1 the slopes increase.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12 11:07                     ` Wu Fengguang
  (?)
@ 2011-08-12 12:17                       ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-12 12:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 19:07 +0800, Wu Fengguang wrote:
> Because pos_ratio was "unsigned long long"..

Ah! totally missed that ;-)

Yes looks good.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 12:17                       ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-12 12:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 19:07 +0800, Wu Fengguang wrote:
> Because pos_ratio was "unsigned long long"..

Ah! totally missed that ;-)

Yes looks good.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 12:17                       ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-12 12:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 19:07 +0800, Wu Fengguang wrote:
> Because pos_ratio was "unsigned long long"..

Ah! totally missed that ;-)

Yes looks good.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12  9:47                 ` Peter Zijlstra
@ 2011-08-12 11:11                   ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-12 11:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 05:47:54PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 10:43 +0800, Wu Fengguang wrote:
> > >                s - x 3
> > >  f(x) :=  1 + (-----)
> > >                l - s
> > 
> 
> > Looks very neat, much simpler than the three curves solution!
> 
> Glad you like it, there is of course the small matter of real-world
> behaviour to consider, lets hope that works as well :-)

It magically meets all the criteria in my mind, not to mention it can
eliminate 2 extra patches. As for the tests, so far, so good :)

Your arithmetics are awesome!

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 11:11                   ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-12 11:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 05:47:54PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 10:43 +0800, Wu Fengguang wrote:
> > >                s - x 3
> > >  f(x) :=  1 + (-----)
> > >                l - s
> > 
> 
> > Looks very neat, much simpler than the three curves solution!
> 
> Glad you like it, there is of course the small matter of real-world
> behaviour to consider, lets hope that works as well :-)

It magically meets all the criteria in my mind, not to mention it can
eliminate 2 extra patches. As for the tests, so far, so good :)

Your arithmetics are awesome!

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12  9:45                   ` Peter Zijlstra
@ 2011-08-12 11:07                     ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-12 11:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 05:45:33PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 13:45 +0800, Wu Fengguang wrote:
> > Code is
> > 
> >         unsigned long freerun = (thresh + bg_thresh) / 2;
> > 
> >         setpoint = (limit + freerun) / 2;
> >         pos_ratio = abs(dirty - setpoint);
> >         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> >         do_div(pos_ratio, limit - setpoint + 1);
> 
> Why do you use do_div()? from the code those things are unsigned long,
> and you can divide that just fine.

Because pos_ratio was "unsigned long long"..

> Also, there's div64_s64 that can do signed divides for s64 types.
> That'll loose the extra conditionals you used for abs and putting the
> sign back.

Ah ok, good to know that :)

> >         x = pos_ratio;
> >         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
> >         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
> 
> So on 32bit with unsigned long that gets 32=2*(10+b) bits for x, that
> solves to 6, which isn't going to be enough I figure since
> (dirty-setpoint) !< 64.
> 
> So you really need to use u64/s64 types here, unsigned long just won't
> do, with u64 you have 64=2(10+b) 22 bits for x, which should fit.

Sure, here is the updated code:

        long long pos_ratio;            /* for scaling up/down the rate limit */
        long x;
       
        if (unlikely(dirty >= limit))
                return 0;

        /*
         * global setpoint
         *
         *                  setpoint - dirty 3
         * f(dirty) := 1 + (----------------)
         *                  limit - setpoint
         *
         * it's a 3rd order polynomial that subjects to
         *
         * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
         * (2) f(setpoint) = 1.0 => the balance point
         * (3) f(limit)    = 0   => the hard limit
         * (4) df/dx < 0         => negative feedback control
         * (5) the closer to setpoint, the smaller |df/dx| (and the reverse),
         *     => fast response on large errors; small oscillation near setpoint
         */
        setpoint = (limit + freerun) / 2;
        pos_ratio = (setpoint - dirty) << RATELIMIT_CALC_SHIFT;
        pos_ratio = div_s64(pos_ratio, limit - setpoint + 1);
        x = pos_ratio;
        pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
        pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
        pos_ratio += 1 << RATELIMIT_CALC_SHIFT;

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 11:07                     ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-12 11:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 05:45:33PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 13:45 +0800, Wu Fengguang wrote:
> > Code is
> > 
> >         unsigned long freerun = (thresh + bg_thresh) / 2;
> > 
> >         setpoint = (limit + freerun) / 2;
> >         pos_ratio = abs(dirty - setpoint);
> >         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> >         do_div(pos_ratio, limit - setpoint + 1);
> 
> Why do you use do_div()? from the code those things are unsigned long,
> and you can divide that just fine.

Because pos_ratio was "unsigned long long"..

> Also, there's div64_s64 that can do signed divides for s64 types.
> That'll loose the extra conditionals you used for abs and putting the
> sign back.

Ah ok, good to know that :)

> >         x = pos_ratio;
> >         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
> >         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
> 
> So on 32bit with unsigned long that gets 32=2*(10+b) bits for x, that
> solves to 6, which isn't going to be enough I figure since
> (dirty-setpoint) !< 64.
> 
> So you really need to use u64/s64 types here, unsigned long just won't
> do, with u64 you have 64=2(10+b) 22 bits for x, which should fit.

Sure, here is the updated code:

        long long pos_ratio;            /* for scaling up/down the rate limit */
        long x;
       
        if (unlikely(dirty >= limit))
                return 0;

        /*
         * global setpoint
         *
         *                  setpoint - dirty 3
         * f(dirty) := 1 + (----------------)
         *                  limit - setpoint
         *
         * it's a 3rd order polynomial that subjects to
         *
         * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
         * (2) f(setpoint) = 1.0 => the balance point
         * (3) f(limit)    = 0   => the hard limit
         * (4) df/dx < 0         => negative feedback control
         * (5) the closer to setpoint, the smaller |df/dx| (and the reverse),
         *     => fast response on large errors; small oscillation near setpoint
         */
        setpoint = (limit + freerun) / 2;
        pos_ratio = (setpoint - dirty) << RATELIMIT_CALC_SHIFT;
        pos_ratio = div_s64(pos_ratio, limit - setpoint + 1);
        x = pos_ratio;
        pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
        pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
        pos_ratio += 1 << RATELIMIT_CALC_SHIFT;

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12  2:43               ` Wu Fengguang
  (?)
@ 2011-08-12  9:47                 ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-12  9:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 10:43 +0800, Wu Fengguang wrote:
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                l - s
> 

> Looks very neat, much simpler than the three curves solution!

Glad you like it, there is of course the small matter of real-world
behaviour to consider, lets hope that works as well :-)

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12  9:47                 ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-12  9:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 10:43 +0800, Wu Fengguang wrote:
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                l - s
> 

> Looks very neat, much simpler than the three curves solution!

Glad you like it, there is of course the small matter of real-world
behaviour to consider, lets hope that works as well :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12  9:47                 ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-12  9:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 10:43 +0800, Wu Fengguang wrote:
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                l - s
> 

> Looks very neat, much simpler than the three curves solution!

Glad you like it, there is of course the small matter of real-world
behaviour to consider, lets hope that works as well :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12  5:45                 ` Wu Fengguang
  (?)
@ 2011-08-12  9:45                   ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-12  9:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 13:45 +0800, Wu Fengguang wrote:
> Code is
> 
>         unsigned long freerun = (thresh + bg_thresh) / 2;
> 
>         setpoint = (limit + freerun) / 2;
>         pos_ratio = abs(dirty - setpoint);
>         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
>         do_div(pos_ratio, limit - setpoint + 1);

Why do you use do_div()? from the code those things are unsigned long,
and you can divide that just fine.

Also, there's div64_s64 that can do signed divides for s64 types.
That'll loose the extra conditionals you used for abs and putting the
sign back.

>         x = pos_ratio;
>         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
>         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;

So on 32bit with unsigned long that gets 32=2*(10+b) bits for x, that
solves to 6, which isn't going to be enough I figure since
(dirty-setpoint) !< 64.

So you really need to use u64/s64 types here, unsigned long just won't
do, with u64 you have 64=2(10+b) 22 bits for x, which should fit.


>         if (dirty > setpoint)
>                 pos_ratio = -pos_ratio;
>         pos_ratio += 1 << BANDWIDTH_CALC_SHIFT; 



^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12  9:45                   ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-12  9:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 13:45 +0800, Wu Fengguang wrote:
> Code is
> 
>         unsigned long freerun = (thresh + bg_thresh) / 2;
> 
>         setpoint = (limit + freerun) / 2;
>         pos_ratio = abs(dirty - setpoint);
>         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
>         do_div(pos_ratio, limit - setpoint + 1);

Why do you use do_div()? from the code those things are unsigned long,
and you can divide that just fine.

Also, there's div64_s64 that can do signed divides for s64 types.
That'll loose the extra conditionals you used for abs and putting the
sign back.

>         x = pos_ratio;
>         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
>         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;

So on 32bit with unsigned long that gets 32=2*(10+b) bits for x, that
solves to 6, which isn't going to be enough I figure since
(dirty-setpoint) !< 64.

So you really need to use u64/s64 types here, unsigned long just won't
do, with u64 you have 64=2(10+b) 22 bits for x, which should fit.


>         if (dirty > setpoint)
>                 pos_ratio = -pos_ratio;
>         pos_ratio += 1 << BANDWIDTH_CALC_SHIFT; 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12  9:45                   ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-12  9:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 13:45 +0800, Wu Fengguang wrote:
> Code is
> 
>         unsigned long freerun = (thresh + bg_thresh) / 2;
> 
>         setpoint = (limit + freerun) / 2;
>         pos_ratio = abs(dirty - setpoint);
>         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
>         do_div(pos_ratio, limit - setpoint + 1);

Why do you use do_div()? from the code those things are unsigned long,
and you can divide that just fine.

Also, there's div64_s64 that can do signed divides for s64 types.
That'll loose the extra conditionals you used for abs and putting the
sign back.

>         x = pos_ratio;
>         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
>         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;

So on 32bit with unsigned long that gets 32=2*(10+b) bits for x, that
solves to 6, which isn't going to be enough I figure since
(dirty-setpoint) !< 64.

So you really need to use u64/s64 types here, unsigned long just won't
do, with u64 you have 64=2(10+b) 22 bits for x, which should fit.


>         if (dirty > setpoint)
>                 pos_ratio = -pos_ratio;
>         pos_ratio += 1 << BANDWIDTH_CALC_SHIFT; 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12  2:43               ` Wu Fengguang
@ 2011-08-12  5:45                 ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-12  5:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> > Making our final function look like:
> > 
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                l - s
> 
> Very intuitive reasoning, thanks!
> 
> I substituted real numbers to the function assuming a mem=2GB system.
> 
> with limit=thresh:
> 
>         gnuplot> set xrange [60000:80000]
>         gnuplot> plot 1 +  (70000.0 - x)**3/(80000-70000.0)**3

I'll use the above one, which is more simple and elegant: 

        f(freerun)  = 2.0
        f(setpoint) = 1.0
        f(limit)    = 0

Code is

        unsigned long freerun = (thresh + bg_thresh) / 2;

        setpoint = (limit + freerun) / 2;
        pos_ratio = abs(dirty - setpoint);
        pos_ratio <<= BANDWIDTH_CALC_SHIFT;
        do_div(pos_ratio, limit - setpoint + 1);
        x = pos_ratio;
        pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
        pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
        if (dirty > setpoint)
                pos_ratio = -pos_ratio;
        pos_ratio += 1 << BANDWIDTH_CALC_SHIFT;

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12  5:45                 ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-12  5:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> > Making our final function look like:
> > 
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                l - s
> 
> Very intuitive reasoning, thanks!
> 
> I substituted real numbers to the function assuming a mem=2GB system.
> 
> with limit=thresh:
> 
>         gnuplot> set xrange [60000:80000]
>         gnuplot> plot 1 +  (70000.0 - x)**3/(80000-70000.0)**3

I'll use the above one, which is more simple and elegant: 

        f(freerun)  = 2.0
        f(setpoint) = 1.0
        f(limit)    = 0

Code is

        unsigned long freerun = (thresh + bg_thresh) / 2;

        setpoint = (limit + freerun) / 2;
        pos_ratio = abs(dirty - setpoint);
        pos_ratio <<= BANDWIDTH_CALC_SHIFT;
        do_div(pos_ratio, limit - setpoint + 1);
        x = pos_ratio;
        pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
        pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
        if (dirty > setpoint)
                pos_ratio = -pos_ratio;
        pos_ratio += 1 << BANDWIDTH_CALC_SHIFT;

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12  2:43               ` Wu Fengguang
  (?)
@ 2011-08-12  3:18               ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-12  3:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 1306 bytes --]

Sorry forgot the 2 gnuplot figures, attached now.

> > Making our final function look like:
> > 
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                l - s
> 
> Very intuitive reasoning, thanks!
> 
> I substituted real numbers to the function assuming a mem=2GB system.
> 
> with limit=thresh:
> 
>         gnuplot> set xrange [60000:80000]
>         gnuplot> plot 1 +  (70000.0 - x)**3/(80000-70000.0)**3
> 
> with limit=thresh+thresh/DIRTY_SCOPE
> 
>         gnuplot> set xrange [60000:90000]
>         gnuplot> plot 1 +  (70000.0 - x)**3/(90000-70000.0)**3
> 
> Figures attached.  The latter produces reasonably flat slope and I'll
> give it a spin in the dd tests :)
>  
> > You can clamp it at [0,2] or so.
> 
> Looking at the figures, we may even do without the clamp because it's
> already inside the range [0, 2].
> 
> > The implementation wouldn't be too horrid either, something like:
> > 
> > unsigned long bdi_pos_ratio(..)
> > {
> > 	if (dirty > limit)
> > 		return 0;
> > 
> > 	if (dirty < 2*setpoint - limit)
> > 		return 2 * SCALE;
> > 
> > 	x = SCALE * (setpoint - dirty) / (limit - setpoint);
> > 	xx = (x * x) / SCALE;
> > 	xxx = (xx * x) / SCALE;
> > 
> > 	return xxx;
> > }
> 
> Looks very neat, much simpler than the three curves solution!
> 
> Thanks,
> Fengguang

[-- Attachment #2: 3rd-order-limit=thresh+halfscope.png --]
[-- Type: image/png, Size: 30247 bytes --]

[-- Attachment #3: 3rd-order-limit=thresh.png --]
[-- Type: image/png, Size: 28785 bytes --]

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-11 22:56             ` Peter Zijlstra
@ 2011-08-12  2:43               ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-12  2:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 06:56:06AM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> > So going by:
> > 
> >                                          write_bw
> >   ref_bw = dirty_ratelimit * pos_ratio * --------
> >                                          dirty_bw
> > 
> > pos_ratio seems to be the feedback on the deviation of the dirty pages
> > around its setpoint. So we adjust the reference bw (or rather ratelimit)
> > to take account of the shift in output vs input capacity as well as the
> > shift in dirty pages around its setpoint.
> > 
> > From that we derive the condition that: 
> > 
> >   pos_ratio(setpoint) := 1
> > 
> > Now in order to create a linear function we need one more condition. We
> > get one from the fact that once we hit the limit we should hard throttle
> > our writers. We get that by setting the ratelimit to 0, because, after
> > all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> > 
> >   pos_ratio(limit) := 0
> > 
> > Using these two conditions we can solve the equations and get your:
> > 
> >                         limit - dirty
> >   pos_ratio(dirty) =  ----------------
> >                       limit - setpoint
> > 
> > Now, for some reason you chose not to use limit, but something like
> > min(limit, 4*thresh) something to do with the slope affecting the rate
> > of adjustment. This wants a comment someplace. 
> 
> Ok, so I think that pos_ratio(limit) := 0, is a stronger condition than
> your negative slope (df/dx < 0), simply because it implies your
> condition and because it expresses our hard stop at limit.

Right. That's good point.

> Also, while I know this is totally over the top, but..
> 
> I saw you added a ramp and brake area in future patches, so have you
> considered using a third order polynomial instead?

No I have not ;)

The 3 lines/curves should be a bit more flexible/configurable than the
single 3rd order polynomial.  However the 3rd order polynomial is sure
much more simple and consistent by removing the explicit rampup/brake
areas and curves.

> The simple:
> 
>  f(x) = -x^3 
> 
> has the 'right' shape, all we need is move it so that:
> 
>  f(s) = 1
> 
> and stretch it to put the single root at our limit. You'd get something
> like:
> 
>                s - x 3
>  f(x) :=  1 + (-----)
>                  d
> 
> Which, as required, is 1 at our setpoint and the factor d stretches the
> middle bit. Which has a single (real) root at: 
> 
>   x = s + d, 
> 
> by setting that to our limit, we get:
> 
>   d = l - s
> 
> Making our final function look like:
> 
>                s - x 3
>  f(x) :=  1 + (-----)
>                l - s

Very intuitive reasoning, thanks!

I substituted real numbers to the function assuming a mem=2GB system.

with limit=thresh:

        gnuplot> set xrange [60000:80000]
        gnuplot> plot 1 +  (70000.0 - x)**3/(80000-70000.0)**3

with limit=thresh+thresh/DIRTY_SCOPE

        gnuplot> set xrange [60000:90000]
        gnuplot> plot 1 +  (70000.0 - x)**3/(90000-70000.0)**3

Figures attached.  The latter produces reasonably flat slope and I'll
give it a spin in the dd tests :)
 
> You can clamp it at [0,2] or so.

Looking at the figures, we may even do without the clamp because it's
already inside the range [0, 2].

> The implementation wouldn't be too horrid either, something like:
> 
> unsigned long bdi_pos_ratio(..)
> {
> 	if (dirty > limit)
> 		return 0;
> 
> 	if (dirty < 2*setpoint - limit)
> 		return 2 * SCALE;
> 
> 	x = SCALE * (setpoint - dirty) / (limit - setpoint);
> 	xx = (x * x) / SCALE;
> 	xxx = (xx * x) / SCALE;
> 
> 	return xxx;
> }

Looks very neat, much simpler than the three curves solution!

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12  2:43               ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-12  2:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 06:56:06AM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> > So going by:
> > 
> >                                          write_bw
> >   ref_bw = dirty_ratelimit * pos_ratio * --------
> >                                          dirty_bw
> > 
> > pos_ratio seems to be the feedback on the deviation of the dirty pages
> > around its setpoint. So we adjust the reference bw (or rather ratelimit)
> > to take account of the shift in output vs input capacity as well as the
> > shift in dirty pages around its setpoint.
> > 
> > From that we derive the condition that: 
> > 
> >   pos_ratio(setpoint) := 1
> > 
> > Now in order to create a linear function we need one more condition. We
> > get one from the fact that once we hit the limit we should hard throttle
> > our writers. We get that by setting the ratelimit to 0, because, after
> > all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> > 
> >   pos_ratio(limit) := 0
> > 
> > Using these two conditions we can solve the equations and get your:
> > 
> >                         limit - dirty
> >   pos_ratio(dirty) =  ----------------
> >                       limit - setpoint
> > 
> > Now, for some reason you chose not to use limit, but something like
> > min(limit, 4*thresh) something to do with the slope affecting the rate
> > of adjustment. This wants a comment someplace. 
> 
> Ok, so I think that pos_ratio(limit) := 0, is a stronger condition than
> your negative slope (df/dx < 0), simply because it implies your
> condition and because it expresses our hard stop at limit.

Right. That's good point.

> Also, while I know this is totally over the top, but..
> 
> I saw you added a ramp and brake area in future patches, so have you
> considered using a third order polynomial instead?

No I have not ;)

The 3 lines/curves should be a bit more flexible/configurable than the
single 3rd order polynomial.  However the 3rd order polynomial is sure
much more simple and consistent by removing the explicit rampup/brake
areas and curves.

> The simple:
> 
>  f(x) = -x^3 
> 
> has the 'right' shape, all we need is move it so that:
> 
>  f(s) = 1
> 
> and stretch it to put the single root at our limit. You'd get something
> like:
> 
>                s - x 3
>  f(x) :=  1 + (-----)
>                  d
> 
> Which, as required, is 1 at our setpoint and the factor d stretches the
> middle bit. Which has a single (real) root at: 
> 
>   x = s + d, 
> 
> by setting that to our limit, we get:
> 
>   d = l - s
> 
> Making our final function look like:
> 
>                s - x 3
>  f(x) :=  1 + (-----)
>                l - s

Very intuitive reasoning, thanks!

I substituted real numbers to the function assuming a mem=2GB system.

with limit=thresh:

        gnuplot> set xrange [60000:80000]
        gnuplot> plot 1 +  (70000.0 - x)**3/(80000-70000.0)**3

with limit=thresh+thresh/DIRTY_SCOPE

        gnuplot> set xrange [60000:90000]
        gnuplot> plot 1 +  (70000.0 - x)**3/(90000-70000.0)**3

Figures attached.  The latter produces reasonably flat slope and I'll
give it a spin in the dd tests :)
 
> You can clamp it at [0,2] or so.

Looking at the figures, we may even do without the clamp because it's
already inside the range [0, 2].

> The implementation wouldn't be too horrid either, something like:
> 
> unsigned long bdi_pos_ratio(..)
> {
> 	if (dirty > limit)
> 		return 0;
> 
> 	if (dirty < 2*setpoint - limit)
> 		return 2 * SCALE;
> 
> 	x = SCALE * (setpoint - dirty) / (limit - setpoint);
> 	xx = (x * x) / SCALE;
> 	xxx = (xx * x) / SCALE;
> 
> 	return xxx;
> }

Looks very neat, much simpler than the three curves solution!

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 23:05           ` Wu Fengguang
  (?)
@ 2011-08-11 22:56             ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-11 22:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> So going by:
> 
>                                          write_bw
>   ref_bw = dirty_ratelimit * pos_ratio * --------
>                                          dirty_bw
> 
> pos_ratio seems to be the feedback on the deviation of the dirty pages
> around its setpoint. So we adjust the reference bw (or rather ratelimit)
> to take account of the shift in output vs input capacity as well as the
> shift in dirty pages around its setpoint.
> 
> From that we derive the condition that: 
> 
>   pos_ratio(setpoint) := 1
> 
> Now in order to create a linear function we need one more condition. We
> get one from the fact that once we hit the limit we should hard throttle
> our writers. We get that by setting the ratelimit to 0, because, after
> all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> 
>   pos_ratio(limit) := 0
> 
> Using these two conditions we can solve the equations and get your:
> 
>                         limit - dirty
>   pos_ratio(dirty) =  ----------------
>                       limit - setpoint
> 
> Now, for some reason you chose not to use limit, but something like
> min(limit, 4*thresh) something to do with the slope affecting the rate
> of adjustment. This wants a comment someplace. 

Ok, so I think that pos_ratio(limit) := 0, is a stronger condition than
your negative slope (df/dx < 0), simply because it implies your
condition and because it expresses our hard stop at limit.

Also, while I know this is totally over the top, but..

I saw you added a ramp and brake area in future patches, so have you
considered using a third order polynomial instead?

The simple:

 f(x) = -x^3 

has the 'right' shape, all we need is move it so that:

 f(s) = 1

and stretch it to put the single root at our limit. You'd get something
like:

               s - x 3
 f(x) :=  1 + (-----)
                 d

Which, as required, is 1 at our setpoint and the factor d stretches the
middle bit. Which has a single (real) root at: 

  x = s + d, 

by setting that to our limit, we get:

  d = l - s

Making our final function look like:

               s - x 3
 f(x) :=  1 + (-----)
               l - s

You can clamp it at [0,2] or so. The implementation wouldn't be too
horrid either, something like:

unsigned long bdi_pos_ratio(..)
{
	if (dirty > limit)
		return 0;

	if (dirty < 2*setpoint - limit)
		return 2 * SCALE;

	x = SCALE * (setpoint - dirty) / (limit - setpoint);
	xx = (x * x) / SCALE;
	xxx = (xx * x) / SCALE;

	return xxx;
}


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-11 22:56             ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-11 22:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> So going by:
> 
>                                          write_bw
>   ref_bw = dirty_ratelimit * pos_ratio * --------
>                                          dirty_bw
> 
> pos_ratio seems to be the feedback on the deviation of the dirty pages
> around its setpoint. So we adjust the reference bw (or rather ratelimit)
> to take account of the shift in output vs input capacity as well as the
> shift in dirty pages around its setpoint.
> 
> From that we derive the condition that: 
> 
>   pos_ratio(setpoint) := 1
> 
> Now in order to create a linear function we need one more condition. We
> get one from the fact that once we hit the limit we should hard throttle
> our writers. We get that by setting the ratelimit to 0, because, after
> all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> 
>   pos_ratio(limit) := 0
> 
> Using these two conditions we can solve the equations and get your:
> 
>                         limit - dirty
>   pos_ratio(dirty) =  ----------------
>                       limit - setpoint
> 
> Now, for some reason you chose not to use limit, but something like
> min(limit, 4*thresh) something to do with the slope affecting the rate
> of adjustment. This wants a comment someplace. 

Ok, so I think that pos_ratio(limit) := 0, is a stronger condition than
your negative slope (df/dx < 0), simply because it implies your
condition and because it expresses our hard stop at limit.

Also, while I know this is totally over the top, but..

I saw you added a ramp and brake area in future patches, so have you
considered using a third order polynomial instead?

The simple:

 f(x) = -x^3 

has the 'right' shape, all we need is move it so that:

 f(s) = 1

and stretch it to put the single root at our limit. You'd get something
like:

               s - x 3
 f(x) :=  1 + (-----)
                 d

Which, as required, is 1 at our setpoint and the factor d stretches the
middle bit. Which has a single (real) root at: 

  x = s + d, 

by setting that to our limit, we get:

  d = l - s

Making our final function look like:

               s - x 3
 f(x) :=  1 + (-----)
               l - s

You can clamp it at [0,2] or so. The implementation wouldn't be too
horrid either, something like:

unsigned long bdi_pos_ratio(..)
{
	if (dirty > limit)
		return 0;

	if (dirty < 2*setpoint - limit)
		return 2 * SCALE;

	x = SCALE * (setpoint - dirty) / (limit - setpoint);
	xx = (x * x) / SCALE;
	xxx = (xx * x) / SCALE;

	return xxx;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-11 22:56             ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-11 22:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> So going by:
> 
>                                          write_bw
>   ref_bw = dirty_ratelimit * pos_ratio * --------
>                                          dirty_bw
> 
> pos_ratio seems to be the feedback on the deviation of the dirty pages
> around its setpoint. So we adjust the reference bw (or rather ratelimit)
> to take account of the shift in output vs input capacity as well as the
> shift in dirty pages around its setpoint.
> 
> From that we derive the condition that: 
> 
>   pos_ratio(setpoint) := 1
> 
> Now in order to create a linear function we need one more condition. We
> get one from the fact that once we hit the limit we should hard throttle
> our writers. We get that by setting the ratelimit to 0, because, after
> all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> 
>   pos_ratio(limit) := 0
> 
> Using these two conditions we can solve the equations and get your:
> 
>                         limit - dirty
>   pos_ratio(dirty) =  ----------------
>                       limit - setpoint
> 
> Now, for some reason you chose not to use limit, but something like
> min(limit, 4*thresh) something to do with the slope affecting the rate
> of adjustment. This wants a comment someplace. 

Ok, so I think that pos_ratio(limit) := 0, is a stronger condition than
your negative slope (df/dx < 0), simply because it implies your
condition and because it expresses our hard stop at limit.

Also, while I know this is totally over the top, but..

I saw you added a ramp and brake area in future patches, so have you
considered using a third order polynomial instead?

The simple:

 f(x) = -x^3 

has the 'right' shape, all we need is move it so that:

 f(s) = 1

and stretch it to put the single root at our limit. You'd get something
like:

               s - x 3
 f(x) :=  1 + (-----)
                 d

Which, as required, is 1 at our setpoint and the factor d stretches the
middle bit. Which has a single (real) root at: 

  x = s + d, 

by setting that to our limit, we get:

  d = l - s

Making our final function look like:

               s - x 3
 f(x) :=  1 + (-----)
               l - s

You can clamp it at [0,2] or so. The implementation wouldn't be too
horrid either, something like:

unsigned long bdi_pos_ratio(..)
{
	if (dirty > limit)
		return 0;

	if (dirty < 2*setpoint - limit)
		return 2 * SCALE;

	x = SCALE * (setpoint - dirty) / (limit - setpoint);
	xx = (x * x) / SCALE;
	xxx = (xx * x) / SCALE;

	return xxx;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-11  2:29                 ` Wu Fengguang
@ 2011-08-11 11:14                   ` Jan Kara
  -1 siblings, 0 replies; 203+ messages in thread
From: Jan Kara @ 2011-08-11 11:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Peter Zijlstra, linux-fsdevel, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Thu 11-08-11 10:29:52, Wu Fengguang wrote:
> On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote:
> > On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> > > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > > > >                     origin - dirty
> > > > >         pos_ratio = --------------
> > > > >                     origin - goal 
> > > > 
> > > > > which comes from the below [*] control line, so that when (dirty == goal),
> > > > > pos_ratio == 1.0:
> > > > 
> > > > OK, so basically you want a linear function for which:
> > > > 
> > > > f(goal) = 1 and has a root somewhere > goal.
> > > > 
> > > > (that one line is much more informative than all your graphs put
> > > > together, one can start from there and derive your function)
> > > > 
> > > > That does indeed get you the above function, now what does it mean? 
> > > 
> > > So going by:
> > > 
> > >                                          write_bw
> > >   ref_bw = dirty_ratelimit * pos_ratio * --------
> > >                                          dirty_bw
> > 
> >   Actually, thinking about these formulas, why do we even bother with
> > computing all these factors like write_bw, dirty_bw, pos_ratio, ...
> > Couldn't we just have a feedback loop (probably similar to the one
> > computing pos_ratio) which will maintain single value - ratelimit? When we
> > are getting close to dirty limit, we will scale ratelimit down, when we
> > will be getting significantly below dirty limit, we will scale the
> > ratelimit up.  Because looking at the formulas it seems to me that the net
> > effect is the same - pos_ratio basically overrules everything... 
> 
> Good question. That is actually one of the early approaches I tried.
> It somehow worked, however the resulted ratelimit is not only slow
> responding, but also oscillating all the time.
  Yes, I think I vaguely remember that.

> This is due to the imperfections
> 
> 1) pos_ratio at best only provides a "direction" for adjusting the
>    ratelimit. There is only vague clues that if pos_ratio is small,
>    the errors in ratelimit should be small.
> 
> 2) Due to time-lag, the assumptions in (1) about "direction" and
>    "error size" can be wrong. The ratelimit may already be
>    over-adjusted when the dirty pages take time to approach the
>    setpoint. The larger memory, the more time lag, the easier to
>    overshoot and oscillate.
> 
> 3) dirty pages are constantly fluctuating around the setpoint,
>    so is pos_ratio.
> 
> With (1) and (2), it's a control system very susceptible to disturbs.
> With (3) we get constant disturbs. Well I had very hard time and
> played dirty tricks (which you may never want to know ;-) trying to
> tradeoff between response time and stableness..
  Yes, I can see especially 2) is a problem. But I don't understand why
your current formula would be that much different. As Peter decoded from
your code, your current formula is:
                                        write_bw
 ref_bw = dirty_ratelimit * pos_ratio * --------
                                        dirty_bw

while previously it was essentially:
 ref_bw = dirty_ratelimit * pos_ratio

So what is so magical about computing write_bw and dirty_bw separately? Is
it because previously you did not use derivation of distance from the goal
for updating pos_ratio? Because in your current formula write_bw/dirty_bw
is a derivation of position...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-11 11:14                   ` Jan Kara
  0 siblings, 0 replies; 203+ messages in thread
From: Jan Kara @ 2011-08-11 11:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Peter Zijlstra, linux-fsdevel, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Thu 11-08-11 10:29:52, Wu Fengguang wrote:
> On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote:
> > On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> > > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > > > >                     origin - dirty
> > > > >         pos_ratio = --------------
> > > > >                     origin - goal 
> > > > 
> > > > > which comes from the below [*] control line, so that when (dirty == goal),
> > > > > pos_ratio == 1.0:
> > > > 
> > > > OK, so basically you want a linear function for which:
> > > > 
> > > > f(goal) = 1 and has a root somewhere > goal.
> > > > 
> > > > (that one line is much more informative than all your graphs put
> > > > together, one can start from there and derive your function)
> > > > 
> > > > That does indeed get you the above function, now what does it mean? 
> > > 
> > > So going by:
> > > 
> > >                                          write_bw
> > >   ref_bw = dirty_ratelimit * pos_ratio * --------
> > >                                          dirty_bw
> > 
> >   Actually, thinking about these formulas, why do we even bother with
> > computing all these factors like write_bw, dirty_bw, pos_ratio, ...
> > Couldn't we just have a feedback loop (probably similar to the one
> > computing pos_ratio) which will maintain single value - ratelimit? When we
> > are getting close to dirty limit, we will scale ratelimit down, when we
> > will be getting significantly below dirty limit, we will scale the
> > ratelimit up.  Because looking at the formulas it seems to me that the net
> > effect is the same - pos_ratio basically overrules everything... 
> 
> Good question. That is actually one of the early approaches I tried.
> It somehow worked, however the resulted ratelimit is not only slow
> responding, but also oscillating all the time.
  Yes, I think I vaguely remember that.

> This is due to the imperfections
> 
> 1) pos_ratio at best only provides a "direction" for adjusting the
>    ratelimit. There is only vague clues that if pos_ratio is small,
>    the errors in ratelimit should be small.
> 
> 2) Due to time-lag, the assumptions in (1) about "direction" and
>    "error size" can be wrong. The ratelimit may already be
>    over-adjusted when the dirty pages take time to approach the
>    setpoint. The larger memory, the more time lag, the easier to
>    overshoot and oscillate.
> 
> 3) dirty pages are constantly fluctuating around the setpoint,
>    so is pos_ratio.
> 
> With (1) and (2), it's a control system very susceptible to disturbs.
> With (3) we get constant disturbs. Well I had very hard time and
> played dirty tricks (which you may never want to know ;-) trying to
> tradeoff between response time and stableness..
  Yes, I can see especially 2) is a problem. But I don't understand why
your current formula would be that much different. As Peter decoded from
your code, your current formula is:
                                        write_bw
 ref_bw = dirty_ratelimit * pos_ratio * --------
                                        dirty_bw

while previously it was essentially:
 ref_bw = dirty_ratelimit * pos_ratio

So what is so magical about computing write_bw and dirty_bw separately? Is
it because previously you did not use derivation of distance from the goal
for updating pos_ratio? Because in your current formula write_bw/dirty_bw
is a derivation of position...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-10 22:34               ` Jan Kara
@ 2011-08-11  2:29                 ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-11  2:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote:
> On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > > >                     origin - dirty
> > > >         pos_ratio = --------------
> > > >                     origin - goal 
> > > 
> > > > which comes from the below [*] control line, so that when (dirty == goal),
> > > > pos_ratio == 1.0:
> > > 
> > > OK, so basically you want a linear function for which:
> > > 
> > > f(goal) = 1 and has a root somewhere > goal.
> > > 
> > > (that one line is much more informative than all your graphs put
> > > together, one can start from there and derive your function)
> > > 
> > > That does indeed get you the above function, now what does it mean? 
> > 
> > So going by:
> > 
> >                                          write_bw
> >   ref_bw = dirty_ratelimit * pos_ratio * --------
> >                                          dirty_bw
> 
>   Actually, thinking about these formulas, why do we even bother with
> computing all these factors like write_bw, dirty_bw, pos_ratio, ...
> Couldn't we just have a feedback loop (probably similar to the one
> computing pos_ratio) which will maintain single value - ratelimit? When we
> are getting close to dirty limit, we will scale ratelimit down, when we
> will be getting significantly below dirty limit, we will scale the
> ratelimit up.  Because looking at the formulas it seems to me that the net
> effect is the same - pos_ratio basically overrules everything... 

Good question. That is actually one of the early approaches I tried.
It somehow worked, however the resulted ratelimit is not only slow
responding, but also oscillating all the time.

This is due to the imperfections

1) pos_ratio at best only provides a "direction" for adjusting the
   ratelimit. There is only vague clues that if pos_ratio is small,
   the errors in ratelimit should be small.

2) Due to time-lag, the assumptions in (1) about "direction" and
   "error size" can be wrong. The ratelimit may already be
   over-adjusted when the dirty pages take time to approach the
   setpoint. The larger memory, the more time lag, the easier to
   overshoot and oscillate.

3) dirty pages are constantly fluctuating around the setpoint,
   so is pos_ratio.

With (1) and (2), it's a control system very susceptible to disturbs.
With (3) we get constant disturbs. Well I had very hard time and
played dirty tricks (which you may never want to know ;-) trying to
tradeoff between response time and stableness..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-11  2:29                 ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-11  2:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote:
> On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > > >                     origin - dirty
> > > >         pos_ratio = --------------
> > > >                     origin - goal 
> > > 
> > > > which comes from the below [*] control line, so that when (dirty == goal),
> > > > pos_ratio == 1.0:
> > > 
> > > OK, so basically you want a linear function for which:
> > > 
> > > f(goal) = 1 and has a root somewhere > goal.
> > > 
> > > (that one line is much more informative than all your graphs put
> > > together, one can start from there and derive your function)
> > > 
> > > That does indeed get you the above function, now what does it mean? 
> > 
> > So going by:
> > 
> >                                          write_bw
> >   ref_bw = dirty_ratelimit * pos_ratio * --------
> >                                          dirty_bw
> 
>   Actually, thinking about these formulas, why do we even bother with
> computing all these factors like write_bw, dirty_bw, pos_ratio, ...
> Couldn't we just have a feedback loop (probably similar to the one
> computing pos_ratio) which will maintain single value - ratelimit? When we
> are getting close to dirty limit, we will scale ratelimit down, when we
> will be getting significantly below dirty limit, we will scale the
> ratelimit up.  Because looking at the formulas it seems to me that the net
> effect is the same - pos_ratio basically overrules everything... 

Good question. That is actually one of the early approaches I tried.
It somehow worked, however the resulted ratelimit is not only slow
responding, but also oscillating all the time.

This is due to the imperfections

1) pos_ratio at best only provides a "direction" for adjusting the
   ratelimit. There is only vague clues that if pos_ratio is small,
   the errors in ratelimit should be small.

2) Due to time-lag, the assumptions in (1) about "direction" and
   "error size" can be wrong. The ratelimit may already be
   over-adjusted when the dirty pages take time to approach the
   setpoint. The larger memory, the more time lag, the easier to
   overshoot and oscillate.

3) dirty pages are constantly fluctuating around the setpoint,
   so is pos_ratio.

With (1) and (2), it's a control system very susceptible to disturbs.
With (3) we get constant disturbs. Well I had very hard time and
played dirty tricks (which you may never want to know ;-) trying to
tradeoff between response time and stableness..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-09 17:20             ` Peter Zijlstra
@ 2011-08-10 22:34               ` Jan Kara
  -1 siblings, 0 replies; 203+ messages in thread
From: Jan Kara @ 2011-08-10 22:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > >                     origin - dirty
> > >         pos_ratio = --------------
> > >                     origin - goal 
> > 
> > > which comes from the below [*] control line, so that when (dirty == goal),
> > > pos_ratio == 1.0:
> > 
> > OK, so basically you want a linear function for which:
> > 
> > f(goal) = 1 and has a root somewhere > goal.
> > 
> > (that one line is much more informative than all your graphs put
> > together, one can start from there and derive your function)
> > 
> > That does indeed get you the above function, now what does it mean? 
> 
> So going by:
> 
>                                          write_bw
>   ref_bw = dirty_ratelimit * pos_ratio * --------
>                                          dirty_bw

  Actually, thinking about these formulas, why do we even bother with
computing all these factors like write_bw, dirty_bw, pos_ratio, ...
Couldn't we just have a feedback loop (probably similar to the one
computing pos_ratio) which will maintain single value - ratelimit? When we
are getting close to dirty limit, we will scale ratelimit down, when we
will be getting significantly below dirty limit, we will scale the
ratelimit up.  Because looking at the formulas it seems to me that the net
effect is the same - pos_ratio basically overrules everything... 

> pos_ratio seems to be the feedback on the deviation of the dirty pages
> around its setpoint. So we adjust the reference bw (or rather ratelimit)
> to take account of the shift in output vs input capacity as well as the
> shift in dirty pages around its setpoint.
> 
> From that we derive the condition that: 
> 
>   pos_ratio(setpoint) := 1
> 
> Now in order to create a linear function we need one more condition. We
> get one from the fact that once we hit the limit we should hard throttle
> our writers. We get that by setting the ratelimit to 0, because, after
> all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> 
>   pos_ratio(limit) := 0
> 
> Using these two conditions we can solve the equations and get your:
> 
>                         limit - dirty
>   pos_ratio(dirty) =  ----------------
>                       limit - setpoint
> 
> Now, for some reason you chose not to use limit, but something like
> min(limit, 4*thresh) something to do with the slope affecting the rate
> of adjustment. This wants a comment someplace.
> 
> 
> Now all of the above would seem to suggest:
> 
>   dirty_ratelimit := ref_bw
> 
> However for that you use:
> 
>   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> 	dirty_ratelimit = max(ref_bw, pos_bw);
> 
>   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> 	dirty_ratelimit = min(ref_bw, pos_bw);
> 
> You have:
> 
>   pos_bw = dirty_ratelimit * pos_ratio
> 
> Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
> why are you ignoring the shift in output vs input rate there?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-10 22:34               ` Jan Kara
  0 siblings, 0 replies; 203+ messages in thread
From: Jan Kara @ 2011-08-10 22:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > >                     origin - dirty
> > >         pos_ratio = --------------
> > >                     origin - goal 
> > 
> > > which comes from the below [*] control line, so that when (dirty == goal),
> > > pos_ratio == 1.0:
> > 
> > OK, so basically you want a linear function for which:
> > 
> > f(goal) = 1 and has a root somewhere > goal.
> > 
> > (that one line is much more informative than all your graphs put
> > together, one can start from there and derive your function)
> > 
> > That does indeed get you the above function, now what does it mean? 
> 
> So going by:
> 
>                                          write_bw
>   ref_bw = dirty_ratelimit * pos_ratio * --------
>                                          dirty_bw

  Actually, thinking about these formulas, why do we even bother with
computing all these factors like write_bw, dirty_bw, pos_ratio, ...
Couldn't we just have a feedback loop (probably similar to the one
computing pos_ratio) which will maintain single value - ratelimit? When we
are getting close to dirty limit, we will scale ratelimit down, when we
will be getting significantly below dirty limit, we will scale the
ratelimit up.  Because looking at the formulas it seems to me that the net
effect is the same - pos_ratio basically overrules everything... 

> pos_ratio seems to be the feedback on the deviation of the dirty pages
> around its setpoint. So we adjust the reference bw (or rather ratelimit)
> to take account of the shift in output vs input capacity as well as the
> shift in dirty pages around its setpoint.
> 
> From that we derive the condition that: 
> 
>   pos_ratio(setpoint) := 1
> 
> Now in order to create a linear function we need one more condition. We
> get one from the fact that once we hit the limit we should hard throttle
> our writers. We get that by setting the ratelimit to 0, because, after
> all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> 
>   pos_ratio(limit) := 0
> 
> Using these two conditions we can solve the equations and get your:
> 
>                         limit - dirty
>   pos_ratio(dirty) =  ----------------
>                       limit - setpoint
> 
> Now, for some reason you chose not to use limit, but something like
> min(limit, 4*thresh) something to do with the slope affecting the rate
> of adjustment. This wants a comment someplace.
> 
> 
> Now all of the above would seem to suggest:
> 
>   dirty_ratelimit := ref_bw
> 
> However for that you use:
> 
>   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> 	dirty_ratelimit = max(ref_bw, pos_bw);
> 
>   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> 	dirty_ratelimit = min(ref_bw, pos_bw);
> 
> You have:
> 
>   pos_bw = dirty_ratelimit * pos_ratio
> 
> Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
> why are you ignoring the shift in output vs input rate there?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 23:05           ` Wu Fengguang
@ 2011-08-10 21:40             ` Vivek Goyal
  -1 siblings, 0 replies; 203+ messages in thread
From: Vivek Goyal @ 2011-08-10 21:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 09, 2011 at 07:05:35AM +0800, Wu Fengguang wrote:
> On Mon, Aug 08, 2011 at 10:41:41PM +0800, Peter Zijlstra wrote:
> > On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> > > @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
> > >         goal = thresh - thresh / DIRTY_SCOPE;
> > >         origin = 4 * thresh;
> > >  
> > > -       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> > > -               origin = limit;                 /* auxiliary control line */
> > > -               goal = (goal + origin) / 2;
> > > -               pos_ratio >>= 1;
> > > -       }
> > >         pos_ratio = origin - dirty;
> > >         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> > >         do_div(pos_ratio, origin - goal + 1); 
> 
> FYI I've updated the fix to the below one, so that @limit will be used
> as the origin in the rare case of (4*thresh < dirty).
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-08 21:56:11.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-09 06:34:25.000000000 +0800
> @@ -536,13 +536,8 @@ static unsigned long bdi_position_ratio(
>  	 * global setpoint
>  	 */
>  	goal = thresh - thresh / DIRTY_SCOPE;
> -	origin = 4 * thresh;
> +	origin = max(4 * thresh, limit);

Hi Fengguang,

Ok, so just trying to understand this pos_ratio little better.

You have following basic formula.

                     origin - dirty
         pos_ratio = --------------
                     origin - goal

Terminology is very confusing and following is my understanding. 

- setpoint == goal

  setpoint is the point where we would like our number of dirty pages to
  be and at this point pos_ratio = 1. For global dirty this number seems
  to be (thresh - thresh / DIRTY_SCOPE) 

- thresh
  dirty page threshold calculated from dirty_ratio (Certain percentage of
  total memory).

- Origin (seems to be equivalent of limit)

  This seems to be the reference point/limit we don't want to cross and
  distance from this limit basically decides the pos_ratio. Closer we
  are to limit, lower the pos_ratio and further we are higher the
  pos_ratio.

So threshold is just a number which helps us determine goal and limit.

goal = thresh - thresh / DIRTY_SCOPE
limit = 4*thresh

So goal is where we want to be and we start throttling the task more as
we move away goal and approach limit. We keep the limit high enough
so that (origin-dirty) does not become negative entity.

So we do expect to cross "thresh" otherwise thresh itself could have
served as limit?

If my understanding is right, then can we get rid of terms "setpoint" and
"origin". Would it be easier to understand the things if we just talk
in terms of "goal" and "limit" and how these are derived from "thresh".

	thresh == soft limit
	limit == 4*thresh (hard limit)
	goal = thresh - thresh / DIRTY_SCOPE (where we want system to
						be in steady state).
                     limit - dirty
         pos_ratio = --------------
                     limit - goal

Thanks
Vivek

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-10 21:40             ` Vivek Goyal
  0 siblings, 0 replies; 203+ messages in thread
From: Vivek Goyal @ 2011-08-10 21:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 09, 2011 at 07:05:35AM +0800, Wu Fengguang wrote:
> On Mon, Aug 08, 2011 at 10:41:41PM +0800, Peter Zijlstra wrote:
> > On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> > > @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
> > >         goal = thresh - thresh / DIRTY_SCOPE;
> > >         origin = 4 * thresh;
> > >  
> > > -       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> > > -               origin = limit;                 /* auxiliary control line */
> > > -               goal = (goal + origin) / 2;
> > > -               pos_ratio >>= 1;
> > > -       }
> > >         pos_ratio = origin - dirty;
> > >         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> > >         do_div(pos_ratio, origin - goal + 1); 
> 
> FYI I've updated the fix to the below one, so that @limit will be used
> as the origin in the rare case of (4*thresh < dirty).
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-08 21:56:11.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-09 06:34:25.000000000 +0800
> @@ -536,13 +536,8 @@ static unsigned long bdi_position_ratio(
>  	 * global setpoint
>  	 */
>  	goal = thresh - thresh / DIRTY_SCOPE;
> -	origin = 4 * thresh;
> +	origin = max(4 * thresh, limit);

Hi Fengguang,

Ok, so just trying to understand this pos_ratio little better.

You have following basic formula.

                     origin - dirty
         pos_ratio = --------------
                     origin - goal

Terminology is very confusing and following is my understanding. 

- setpoint == goal

  setpoint is the point where we would like our number of dirty pages to
  be and at this point pos_ratio = 1. For global dirty this number seems
  to be (thresh - thresh / DIRTY_SCOPE) 

- thresh
  dirty page threshold calculated from dirty_ratio (Certain percentage of
  total memory).

- Origin (seems to be equivalent of limit)

  This seems to be the reference point/limit we don't want to cross and
  distance from this limit basically decides the pos_ratio. Closer we
  are to limit, lower the pos_ratio and further we are higher the
  pos_ratio.

So threshold is just a number which helps us determine goal and limit.

goal = thresh - thresh / DIRTY_SCOPE
limit = 4*thresh

So goal is where we want to be and we start throttling the task more as
we move away goal and approach limit. We keep the limit high enough
so that (origin-dirty) does not become negative entity.

So we do expect to cross "thresh" otherwise thresh itself could have
served as limit?

If my understanding is right, then can we get rid of terms "setpoint" and
"origin". Would it be easier to understand the things if we just talk
in terms of "goal" and "limit" and how these are derived from "thresh".

	thresh == soft limit
	limit == 4*thresh (hard limit)
	goal = thresh - thresh / DIRTY_SCOPE (where we want system to
						be in steady state).
                     limit - dirty
         pos_ratio = --------------
                     limit - goal

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-09  9:31             ` Peter Zijlstra
@ 2011-08-10 12:28               ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-10 12:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 09, 2011 at 05:31:44PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 06:47 +0800, Wu Fengguang wrote:
> > origin is where the control line crosses the X axis (in both the
> > global/bdi setpoint cases). 
> 
> Ah, that's normally called zero, root or or x-intercept:
> 
> http://en.wikipedia.org/wiki/X-intercept

Yes indeed! I'll change the name to x_intercept.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-10 12:28               ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-10 12:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 09, 2011 at 05:31:44PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 06:47 +0800, Wu Fengguang wrote:
> > origin is where the control line crosses the X axis (in both the
> > global/bdi setpoint cases). 
> 
> Ah, that's normally called zero, root or or x-intercept:
> 
> http://en.wikipedia.org/wiki/X-intercept

Yes indeed! I'll change the name to x_intercept.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 23:05           ` Wu Fengguang
  (?)
@ 2011-08-09 17:20             ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-09 17:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> >                     origin - dirty
> >         pos_ratio = --------------
> >                     origin - goal 
> 
> > which comes from the below [*] control line, so that when (dirty == goal),
> > pos_ratio == 1.0:
> 
> OK, so basically you want a linear function for which:
> 
> f(goal) = 1 and has a root somewhere > goal.
> 
> (that one line is much more informative than all your graphs put
> together, one can start from there and derive your function)
> 
> That does indeed get you the above function, now what does it mean? 

So going by:

                                         write_bw
  ref_bw = dirty_ratelimit * pos_ratio * --------
                                         dirty_bw

pos_ratio seems to be the feedback on the deviation of the dirty pages
around its setpoint. So we adjust the reference bw (or rather ratelimit)
to take account of the shift in output vs input capacity as well as the
shift in dirty pages around its setpoint.

>From that we derive the condition that: 

  pos_ratio(setpoint) := 1

Now in order to create a linear function we need one more condition. We
get one from the fact that once we hit the limit we should hard throttle
our writers. We get that by setting the ratelimit to 0, because, after
all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:

  pos_ratio(limit) := 0

Using these two conditions we can solve the equations and get your:

                        limit - dirty
  pos_ratio(dirty) =  ----------------
                      limit - setpoint

Now, for some reason you chose not to use limit, but something like
min(limit, 4*thresh) something to do with the slope affecting the rate
of adjustment. This wants a comment someplace.


Now all of the above would seem to suggest:

  dirty_ratelimit := ref_bw

However for that you use:

  if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
	dirty_ratelimit = max(ref_bw, pos_bw);

  if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
	dirty_ratelimit = min(ref_bw, pos_bw);

You have:

  pos_bw = dirty_ratelimit * pos_ratio

Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
why are you ignoring the shift in output vs input rate there?




^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-09 17:20             ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-09 17:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> >                     origin - dirty
> >         pos_ratio = --------------
> >                     origin - goal 
> 
> > which comes from the below [*] control line, so that when (dirty == goal),
> > pos_ratio == 1.0:
> 
> OK, so basically you want a linear function for which:
> 
> f(goal) = 1 and has a root somewhere > goal.
> 
> (that one line is much more informative than all your graphs put
> together, one can start from there and derive your function)
> 
> That does indeed get you the above function, now what does it mean? 

So going by:

                                         write_bw
  ref_bw = dirty_ratelimit * pos_ratio * --------
                                         dirty_bw

pos_ratio seems to be the feedback on the deviation of the dirty pages
around its setpoint. So we adjust the reference bw (or rather ratelimit)
to take account of the shift in output vs input capacity as well as the
shift in dirty pages around its setpoint.

From that we derive the condition that: 

  pos_ratio(setpoint) := 1

Now in order to create a linear function we need one more condition. We
get one from the fact that once we hit the limit we should hard throttle
our writers. We get that by setting the ratelimit to 0, because, after
all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:

  pos_ratio(limit) := 0

Using these two conditions we can solve the equations and get your:

                        limit - dirty
  pos_ratio(dirty) =  ----------------
                      limit - setpoint

Now, for some reason you chose not to use limit, but something like
min(limit, 4*thresh) something to do with the slope affecting the rate
of adjustment. This wants a comment someplace.


Now all of the above would seem to suggest:

  dirty_ratelimit := ref_bw

However for that you use:

  if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
	dirty_ratelimit = max(ref_bw, pos_bw);

  if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
	dirty_ratelimit = min(ref_bw, pos_bw);

You have:

  pos_bw = dirty_ratelimit * pos_ratio

Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
why are you ignoring the shift in output vs input rate there?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-09 17:20             ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-09 17:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> >                     origin - dirty
> >         pos_ratio = --------------
> >                     origin - goal 
> 
> > which comes from the below [*] control line, so that when (dirty == goal),
> > pos_ratio == 1.0:
> 
> OK, so basically you want a linear function for which:
> 
> f(goal) = 1 and has a root somewhere > goal.
> 
> (that one line is much more informative than all your graphs put
> together, one can start from there and derive your function)
> 
> That does indeed get you the above function, now what does it mean? 

So going by:

                                         write_bw
  ref_bw = dirty_ratelimit * pos_ratio * --------
                                         dirty_bw

pos_ratio seems to be the feedback on the deviation of the dirty pages
around its setpoint. So we adjust the reference bw (or rather ratelimit)
to take account of the shift in output vs input capacity as well as the
shift in dirty pages around its setpoint.

From that we derive the condition that: 

  pos_ratio(setpoint) := 1

Now in order to create a linear function we need one more condition. We
get one from the fact that once we hit the limit we should hard throttle
our writers. We get that by setting the ratelimit to 0, because, after
all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:

  pos_ratio(limit) := 0

Using these two conditions we can solve the equations and get your:

                        limit - dirty
  pos_ratio(dirty) =  ----------------
                      limit - setpoint

Now, for some reason you chose not to use limit, but something like
min(limit, 4*thresh) something to do with the slope affecting the rate
of adjustment. This wants a comment someplace.


Now all of the above would seem to suggest:

  dirty_ratelimit := ref_bw

However for that you use:

  if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
	dirty_ratelimit = max(ref_bw, pos_bw);

  if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
	dirty_ratelimit = min(ref_bw, pos_bw);

You have:

  pos_bw = dirty_ratelimit * pos_ratio

Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
why are you ignoring the shift in output vs input rate there?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 23:05           ` Wu Fengguang
  (?)
@ 2011-08-09 10:32             ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-09 10:32 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 07:05 +0800, Wu Fengguang wrote:
> This is the more meaningful view :)
> 
>                     origin - dirty
>         pos_ratio = --------------
>                     origin - goal 

> which comes from the below [*] control line, so that when (dirty == goal),
> pos_ratio == 1.0:

OK, so basically you want a linear function for which:

f(goal) = 1 and has a root somewhere > goal.

(that one line is much more informative than all your graphs put
together, one can start from there and derive your function)

That does indeed get you the above function, now what does it mean?

> + *  When the number of dirty pages go higher/lower than the setpoint, the dirty
> + *  position ratio (and hence dirty rate limit) will be decreased/increased to
> + *  bring the dirty pages back to the setpoint.

(you seem inconsistent with your terminology, I think goal and setpoint
are interchanged? I looked up set point and its a term from control
system theory, so I'll chalk that up to my own ignorance..)

Ok, so higher dirty -> lower position ration -> lower dirty rate (and
the inverse), now what does that do...

/me goes read other patches in search of more clues.. I'm starting to
dislike graphs.. why not simply state where those things come from,
that's much easier.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-09 10:32             ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-09 10:32 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 07:05 +0800, Wu Fengguang wrote:
> This is the more meaningful view :)
> 
>                     origin - dirty
>         pos_ratio = --------------
>                     origin - goal 

> which comes from the below [*] control line, so that when (dirty == goal),
> pos_ratio == 1.0:

OK, so basically you want a linear function for which:

f(goal) = 1 and has a root somewhere > goal.

(that one line is much more informative than all your graphs put
together, one can start from there and derive your function)

That does indeed get you the above function, now what does it mean?

> + *  When the number of dirty pages go higher/lower than the setpoint, the dirty
> + *  position ratio (and hence dirty rate limit) will be decreased/increased to
> + *  bring the dirty pages back to the setpoint.

(you seem inconsistent with your terminology, I think goal and setpoint
are interchanged? I looked up set point and its a term from control
system theory, so I'll chalk that up to my own ignorance..)

Ok, so higher dirty -> lower position ration -> lower dirty rate (and
the inverse), now what does that do...

/me goes read other patches in search of more clues.. I'm starting to
dislike graphs.. why not simply state where those things come from,
that's much easier.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-09 10:32             ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-09 10:32 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 07:05 +0800, Wu Fengguang wrote:
> This is the more meaningful view :)
> 
>                     origin - dirty
>         pos_ratio = --------------
>                     origin - goal 

> which comes from the below [*] control line, so that when (dirty == goal),
> pos_ratio == 1.0:

OK, so basically you want a linear function for which:

f(goal) = 1 and has a root somewhere > goal.

(that one line is much more informative than all your graphs put
together, one can start from there and derive your function)

That does indeed get you the above function, now what does it mean?

> + *  When the number of dirty pages go higher/lower than the setpoint, the dirty
> + *  position ratio (and hence dirty rate limit) will be decreased/increased to
> + *  bring the dirty pages back to the setpoint.

(you seem inconsistent with your terminology, I think goal and setpoint
are interchanged? I looked up set point and its a term from control
system theory, so I'll chalk that up to my own ignorance..)

Ok, so higher dirty -> lower position ration -> lower dirty rate (and
the inverse), now what does that do...

/me goes read other patches in search of more clues.. I'm starting to
dislike graphs.. why not simply state where those things come from,
that's much easier.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 22:47           ` Wu Fengguang
  (?)
@ 2011-08-09  9:31             ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-09  9:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 06:47 +0800, Wu Fengguang wrote:
> origin is where the control line crosses the X axis (in both the
> global/bdi setpoint cases). 

Ah, that's normally called zero, root or or x-intercept:

http://en.wikipedia.org/wiki/X-intercept

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-09  9:31             ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-09  9:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 06:47 +0800, Wu Fengguang wrote:
> origin is where the control line crosses the X axis (in both the
> global/bdi setpoint cases). 

Ah, that's normally called zero, root or or x-intercept:

http://en.wikipedia.org/wiki/X-intercept

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-09  9:31             ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-09  9:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 06:47 +0800, Wu Fengguang wrote:
> origin is where the control line crosses the X axis (in both the
> global/bdi setpoint cases). 

Ah, that's normally called zero, root or or x-intercept:

http://en.wikipedia.org/wiki/X-intercept

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-06  8:44   ` Wu Fengguang
@ 2011-08-09  2:08     ` Vivek Goyal
  -1 siblings, 0 replies; 203+ messages in thread
From: Vivek Goyal @ 2011-08-09  2:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:49PM +0800, Wu Fengguang wrote:
> Old scheme is,
>                                           |
>                            free run area  |  throttle area
>   ----------------------------------------+---------------------------->
>                                     thresh^                  dirty pages
> 
> New scheme is,
> 
>   ^ task rate limit
>   |
>   |            *
>   |             *
>   |              *
>   |[free run]      *      [smooth throttled]
>   |                  *
>   |                     *
>   |                         *
>   ..bdi->dirty_ratelimit..........*
>   |                               .     *
>   |                               .          *
>   |                               .              *
>   |                               .                 *
>   |                               .                    *
>   +-------------------------------.-----------------------*------------>
>                           setpoint^                  limit^  dirty pages
> 
> For simplicity, only the global/bdi setpoint control lines are
> implemented here, so the [*] curve is more straight than the ideal one
> showed in the above figure.
> 
> bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> that the resulted task rate limit can drive the dirty pages back to the
> global/bdi setpoints.
> 

IMHO, "position_ratio" is not necessarily very intutive. Can there be
a better name? Based on your slides, it is scaling factor applied to
task rate limit depending on how well we are doing in terms of meeting
our goal of dirty limit. Will "dirty_rate_scale_factor" or something like
that make sense and be little more intutive? 

Thanks
Vivek
 

> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |  143 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 143 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-06 10:31:32.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-06 11:17:07.000000000 +0800
> @@ -46,6 +46,8 @@
>   */
>  #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
>  
> +#define BANDWIDTH_CALC_SHIFT	10
> +
>  /*
>   * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
>   * will look to see if it needs to force writeback or throttling.
> @@ -495,6 +497,147 @@ unsigned long bdi_dirty_limit(struct bac
>  	return bdi_dirty;
>  }
>  
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + *  When the number of dirty pages go higher/lower than the setpoint, the dirty
> + *  position ratio (and hence dirty rate limit) will be decreased/increased to
> + *  bring the dirty pages back to the setpoint.
> + *
> + *                              setpoint
> + *                                 v
> + * |-------------------------------*-------------------------------|-----------|
> + * ^                               ^                               ^           ^
> + * (thresh + background_thresh)/2  thresh - thresh/DIRTY_SCOPE     thresh  limit
> + *
> + *                          bdi setpoint
> + *                                 v
> + * |-------------------------------*-------------------------------------------|
> + * ^                               ^                                           ^
> + * 0                               bdi_thresh - bdi_thresh/DIRTY_SCOPE     limit
> + *
> + * (o) pseudo code
> + *
> + *     pos_ratio = 1 << BANDWIDTH_CALC_SHIFT
> + *
> + *     if (dirty < thresh) scale up   pos_ratio
> + *     if (dirty > thresh) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_thresh) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_thresh) scale down pos_ratio
> + *
> + * (o) global/bdi control lines
> + *
> + * Based on the number of dirty pages (the X), pos_ratio (the Y) is scaled by
> + * several control lines in turn.
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * If any control line drops below Y=0 before reaching @limit, an auxiliary
> + * line will be setup to connect them. The below figure illustrates the main
> + * bdi control line with an auxiliary line extending it to @limit.
> + *
> + * This allows smoothly throttling bdi_dirty down to normal if it starts high
> + * in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to 5 times higher than bdi setpoint.
> + * - the bdi dirty thresh goes down quickly due to change of JBOD workload
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, bw scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, bw scale = 1/2
> + *                       |               .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0                 bdi setpoint                 bdi origin            limit
> + *
> + * The bdi control line: if (origin < limit), an auxiliary control line (*)
> + * will be setup to extend the main control line (o) to @limit.
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long origin;
> +	unsigned long goal;
> +	unsigned long long span;
> +	unsigned long long pos_ratio;	/* for scaling up/down the rate limit */
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 */
> +	goal = thresh - thresh / DIRTY_SCOPE;
> +	origin = 4 * thresh;
> +
> +	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> +		origin = limit;			/* auxiliary control line */
> +		goal = (goal + origin) / 2;
> +		pos_ratio >>= 1;
> +	}
> +	pos_ratio = origin - dirty;
> +	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> +	do_div(pos_ratio, origin - goal + 1);
> +
> +	/*
> +	 * bdi setpoint
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE;
> +	/*
> +	 * Use span=(4*bw) in single disk case and transit to bdi_thresh in
> +	 * JBOD case.  For JBOD, bdi_thresh could fluctuate up to its own size.
> +	 * Otherwise the bdi write bandwidth is good for limiting the floating
> +	 * area, which makes the bdi control line a good backup when the global
> +	 * control line is too flat/weak in large memory systems.
> +	 */
> +	span = (u64) bdi_thresh * (thresh - bdi_thresh) +
> +		(4 * bdi->avg_write_bandwidth) * bdi_thresh;
> +	do_div(span, thresh + 1);
> +	origin = goal + 2 * span;
> +
> +	if (unlikely(bdi_dirty > goal + span)) {
> +		if (bdi_dirty > limit)
> +			return 0;
> +		if (origin < limit) {
> +			origin = limit;		/* auxiliary control line */
> +			goal += span;
> +			pos_ratio >>= 1;
> +		}
> +	}
> +	pos_ratio *= origin - bdi_dirty;
> +	do_div(pos_ratio, origin - goal + 1);
> +
> +	return pos_ratio;
> +}
> +
>  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
>  				       unsigned long elapsed,
>  				       unsigned long written)
> 

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-09  2:08     ` Vivek Goyal
  0 siblings, 0 replies; 203+ messages in thread
From: Vivek Goyal @ 2011-08-09  2:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:49PM +0800, Wu Fengguang wrote:
> Old scheme is,
>                                           |
>                            free run area  |  throttle area
>   ----------------------------------------+---------------------------->
>                                     thresh^                  dirty pages
> 
> New scheme is,
> 
>   ^ task rate limit
>   |
>   |            *
>   |             *
>   |              *
>   |[free run]      *      [smooth throttled]
>   |                  *
>   |                     *
>   |                         *
>   ..bdi->dirty_ratelimit..........*
>   |                               .     *
>   |                               .          *
>   |                               .              *
>   |                               .                 *
>   |                               .                    *
>   +-------------------------------.-----------------------*------------>
>                           setpoint^                  limit^  dirty pages
> 
> For simplicity, only the global/bdi setpoint control lines are
> implemented here, so the [*] curve is more straight than the ideal one
> showed in the above figure.
> 
> bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> that the resulted task rate limit can drive the dirty pages back to the
> global/bdi setpoints.
> 

IMHO, "position_ratio" is not necessarily very intutive. Can there be
a better name? Based on your slides, it is scaling factor applied to
task rate limit depending on how well we are doing in terms of meeting
our goal of dirty limit. Will "dirty_rate_scale_factor" or something like
that make sense and be little more intutive? 

Thanks
Vivek
 

> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |  143 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 143 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-06 10:31:32.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-06 11:17:07.000000000 +0800
> @@ -46,6 +46,8 @@
>   */
>  #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
>  
> +#define BANDWIDTH_CALC_SHIFT	10
> +
>  /*
>   * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
>   * will look to see if it needs to force writeback or throttling.
> @@ -495,6 +497,147 @@ unsigned long bdi_dirty_limit(struct bac
>  	return bdi_dirty;
>  }
>  
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + *  When the number of dirty pages go higher/lower than the setpoint, the dirty
> + *  position ratio (and hence dirty rate limit) will be decreased/increased to
> + *  bring the dirty pages back to the setpoint.
> + *
> + *                              setpoint
> + *                                 v
> + * |-------------------------------*-------------------------------|-----------|
> + * ^                               ^                               ^           ^
> + * (thresh + background_thresh)/2  thresh - thresh/DIRTY_SCOPE     thresh  limit
> + *
> + *                          bdi setpoint
> + *                                 v
> + * |-------------------------------*-------------------------------------------|
> + * ^                               ^                                           ^
> + * 0                               bdi_thresh - bdi_thresh/DIRTY_SCOPE     limit
> + *
> + * (o) pseudo code
> + *
> + *     pos_ratio = 1 << BANDWIDTH_CALC_SHIFT
> + *
> + *     if (dirty < thresh) scale up   pos_ratio
> + *     if (dirty > thresh) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_thresh) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_thresh) scale down pos_ratio
> + *
> + * (o) global/bdi control lines
> + *
> + * Based on the number of dirty pages (the X), pos_ratio (the Y) is scaled by
> + * several control lines in turn.
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * If any control line drops below Y=0 before reaching @limit, an auxiliary
> + * line will be setup to connect them. The below figure illustrates the main
> + * bdi control line with an auxiliary line extending it to @limit.
> + *
> + * This allows smoothly throttling bdi_dirty down to normal if it starts high
> + * in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to 5 times higher than bdi setpoint.
> + * - the bdi dirty thresh goes down quickly due to change of JBOD workload
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, bw scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, bw scale = 1/2
> + *                       |               .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0                 bdi setpoint                 bdi origin            limit
> + *
> + * The bdi control line: if (origin < limit), an auxiliary control line (*)
> + * will be setup to extend the main control line (o) to @limit.
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long origin;
> +	unsigned long goal;
> +	unsigned long long span;
> +	unsigned long long pos_ratio;	/* for scaling up/down the rate limit */
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 */
> +	goal = thresh - thresh / DIRTY_SCOPE;
> +	origin = 4 * thresh;
> +
> +	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> +		origin = limit;			/* auxiliary control line */
> +		goal = (goal + origin) / 2;
> +		pos_ratio >>= 1;
> +	}
> +	pos_ratio = origin - dirty;
> +	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> +	do_div(pos_ratio, origin - goal + 1);
> +
> +	/*
> +	 * bdi setpoint
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE;
> +	/*
> +	 * Use span=(4*bw) in single disk case and transit to bdi_thresh in
> +	 * JBOD case.  For JBOD, bdi_thresh could fluctuate up to its own size.
> +	 * Otherwise the bdi write bandwidth is good for limiting the floating
> +	 * area, which makes the bdi control line a good backup when the global
> +	 * control line is too flat/weak in large memory systems.
> +	 */
> +	span = (u64) bdi_thresh * (thresh - bdi_thresh) +
> +		(4 * bdi->avg_write_bandwidth) * bdi_thresh;
> +	do_div(span, thresh + 1);
> +	origin = goal + 2 * span;
> +
> +	if (unlikely(bdi_dirty > goal + span)) {
> +		if (bdi_dirty > limit)
> +			return 0;
> +		if (origin < limit) {
> +			origin = limit;		/* auxiliary control line */
> +			goal += span;
> +			pos_ratio >>= 1;
> +		}
> +	}
> +	pos_ratio *= origin - bdi_dirty;
> +	do_div(pos_ratio, origin - goal + 1);
> +
> +	return pos_ratio;
> +}
> +
>  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
>  				       unsigned long elapsed,
>  				       unsigned long written)
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 14:41         ` Peter Zijlstra
@ 2011-08-08 23:05           ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-08 23:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 10:41:41PM +0800, Peter Zijlstra wrote:
> On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> > @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
> >         goal = thresh - thresh / DIRTY_SCOPE;
> >         origin = 4 * thresh;
> >  
> > -       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> > -               origin = limit;                 /* auxiliary control line */
> > -               goal = (goal + origin) / 2;
> > -               pos_ratio >>= 1;
> > -       }
> >         pos_ratio = origin - dirty;
> >         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> >         do_div(pos_ratio, origin - goal + 1); 

FYI I've updated the fix to the below one, so that @limit will be used
as the origin in the rare case of (4*thresh < dirty).

--- linux-next.orig/mm/page-writeback.c	2011-08-08 21:56:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-09 06:34:25.000000000 +0800
@@ -536,13 +536,8 @@ static unsigned long bdi_position_ratio(
 	 * global setpoint
 	 */
 	goal = thresh - thresh / DIRTY_SCOPE;
-	origin = 4 * thresh;
+	origin = max(4 * thresh, limit);
 
-	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
-		origin = limit;			/* auxiliary control line */
-		goal = (goal + origin) / 2;
-		pos_ratio >>= 1;
-	}
 	pos_ratio = origin - dirty;
 	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
 	do_div(pos_ratio, origin - goal + 1);

> So basically, pos_ratio = (4t - d) / (25/8)t, which if I'm not mistaken
> comes out at 32/25 - 8d/25t. Which simply doesn't make sense at all. 

This is the more meaningful view :)

                    origin - dirty
        pos_ratio = --------------
                    origin - goal

which comes from the below [*] control line, so that when (dirty == goal),
pos_ratio == 1.0:

 ^ pos_ratio
 |
 |
 |   *
 |      *
 |         *
 |            *
 |               *
 |                  *
 |                     *
 |                        *
 |                           *
 |                              *
 |                                 *
 .. pos_ratio = 1.0 ..................*
 |                                    .  *
 |                                    .     *
 |                                    .        *
 |                                    .           *
 |                                    .              *
 |                                    .                 *
 |                                    .                    *
 |                                    .                       *
 |                                    .                          *
 |                                    .                             *
 |                                    .                                *
 |                                    .                                   *
 |                                    .                                      *
 |                                    .                                         *
 |                                    .                                            *
 |                                    .                                               *
 +------------------------------------.--------------------------------------------------*---------------------->
 0                                   goal                                              origin         dirty pages

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 23:05           ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-08 23:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 10:41:41PM +0800, Peter Zijlstra wrote:
> On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> > @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
> >         goal = thresh - thresh / DIRTY_SCOPE;
> >         origin = 4 * thresh;
> >  
> > -       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> > -               origin = limit;                 /* auxiliary control line */
> > -               goal = (goal + origin) / 2;
> > -               pos_ratio >>= 1;
> > -       }
> >         pos_ratio = origin - dirty;
> >         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> >         do_div(pos_ratio, origin - goal + 1); 

FYI I've updated the fix to the below one, so that @limit will be used
as the origin in the rare case of (4*thresh < dirty).

--- linux-next.orig/mm/page-writeback.c	2011-08-08 21:56:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-09 06:34:25.000000000 +0800
@@ -536,13 +536,8 @@ static unsigned long bdi_position_ratio(
 	 * global setpoint
 	 */
 	goal = thresh - thresh / DIRTY_SCOPE;
-	origin = 4 * thresh;
+	origin = max(4 * thresh, limit);
 
-	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
-		origin = limit;			/* auxiliary control line */
-		goal = (goal + origin) / 2;
-		pos_ratio >>= 1;
-	}
 	pos_ratio = origin - dirty;
 	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
 	do_div(pos_ratio, origin - goal + 1);

> So basically, pos_ratio = (4t - d) / (25/8)t, which if I'm not mistaken
> comes out at 32/25 - 8d/25t. Which simply doesn't make sense at all. 

This is the more meaningful view :)

                    origin - dirty
        pos_ratio = --------------
                    origin - goal

which comes from the below [*] control line, so that when (dirty == goal),
pos_ratio == 1.0:

 ^ pos_ratio
 |
 |
 |   *
 |      *
 |         *
 |            *
 |               *
 |                  *
 |                     *
 |                        *
 |                           *
 |                              *
 |                                 *
 .. pos_ratio = 1.0 ..................*
 |                                    .  *
 |                                    .     *
 |                                    .        *
 |                                    .           *
 |                                    .              *
 |                                    .                 *
 |                                    .                    *
 |                                    .                       *
 |                                    .                          *
 |                                    .                             *
 |                                    .                                *
 |                                    .                                   *
 |                                    .                                      *
 |                                    .                                         *
 |                                    .                                            *
 |                                    .                                               *
 +------------------------------------.--------------------------------------------------*---------------------->
 0                                   goal                                              origin         dirty pages

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 14:31         ` Peter Zijlstra
@ 2011-08-08 22:47           ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-08 22:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 10:31:49PM +0800, Peter Zijlstra wrote:
> On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> > It's actually dead code because (origin < limit) should never happen.
> > I feel so good being able to drop 5 more lines of code :) 
> 
> OK, but that leaves me trying to figure out what origin is, and why its
> 4 * thresh.

origin is where the control line crosses the X axis (in both the
global/bdi setpoint cases).

"4 * thresh" is merely something larger than max(dirty, thresh)
that yields reasonably gentle slope. The more slope, the larger
"gravity" to bring the dirty pages back to the setpoint.

> I'm having a horrible time understanding this stuff.

Sorry for that. Do you have more questions?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 22:47           ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-08 22:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 10:31:49PM +0800, Peter Zijlstra wrote:
> On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> > It's actually dead code because (origin < limit) should never happen.
> > I feel so good being able to drop 5 more lines of code :) 
> 
> OK, but that leaves me trying to figure out what origin is, and why its
> 4 * thresh.

origin is where the control line crosses the X axis (in both the
global/bdi setpoint cases).

"4 * thresh" is merely something larger than max(dirty, thresh)
that yields reasonably gentle slope. The more slope, the larger
"gravity" to bring the dirty pages back to the setpoint.

> I'm having a horrible time understanding this stuff.

Sorry for that. Do you have more questions?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 14:11       ` Wu Fengguang
  (?)
@ 2011-08-08 14:41         ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-08 14:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
>         goal = thresh - thresh / DIRTY_SCOPE;
>         origin = 4 * thresh;
>  
> -       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> -               origin = limit;                 /* auxiliary control line */
> -               goal = (goal + origin) / 2;
> -               pos_ratio >>= 1;
> -       }
>         pos_ratio = origin - dirty;
>         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
>         do_div(pos_ratio, origin - goal + 1); 

So basically, pos_ratio = (4t - d) / (25/8)t, which if I'm not mistaken
comes out at 32/25 - 8d/25t. Which simply doesn't make sense at all. 



^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 14:41         ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-08 14:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
>         goal = thresh - thresh / DIRTY_SCOPE;
>         origin = 4 * thresh;
>  
> -       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> -               origin = limit;                 /* auxiliary control line */
> -               goal = (goal + origin) / 2;
> -               pos_ratio >>= 1;
> -       }
>         pos_ratio = origin - dirty;
>         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
>         do_div(pos_ratio, origin - goal + 1); 

So basically, pos_ratio = (4t - d) / (25/8)t, which if I'm not mistaken
comes out at 32/25 - 8d/25t. Which simply doesn't make sense at all. 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 14:41         ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-08 14:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
>         goal = thresh - thresh / DIRTY_SCOPE;
>         origin = 4 * thresh;
>  
> -       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> -               origin = limit;                 /* auxiliary control line */
> -               goal = (goal + origin) / 2;
> -               pos_ratio >>= 1;
> -       }
>         pos_ratio = origin - dirty;
>         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
>         do_div(pos_ratio, origin - goal + 1); 

So basically, pos_ratio = (4t - d) / (25/8)t, which if I'm not mistaken
comes out at 32/25 - 8d/25t. Which simply doesn't make sense at all. 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 14:11       ` Wu Fengguang
  (?)
@ 2011-08-08 14:31         ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-08 14:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> It's actually dead code because (origin < limit) should never happen.
> I feel so good being able to drop 5 more lines of code :) 

OK, but that leaves me trying to figure out what origin is, and why its
4 * thresh.

I'm having a horrible time understanding this stuff.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 14:31         ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-08 14:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> It's actually dead code because (origin < limit) should never happen.
> I feel so good being able to drop 5 more lines of code :) 

OK, but that leaves me trying to figure out what origin is, and why its
4 * thresh.

I'm having a horrible time understanding this stuff.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 14:31         ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-08 14:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> It's actually dead code because (origin < limit) should never happen.
> I feel so good being able to drop 5 more lines of code :) 

OK, but that leaves me trying to figure out what origin is, and why its
4 * thresh.

I'm having a horrible time understanding this stuff.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 13:46     ` Peter Zijlstra
@ 2011-08-08 14:11       ` Wu Fengguang
  -1 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-08 14:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 09:46:33PM +0800, Peter Zijlstra wrote:
> On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > +                                       unsigned long thresh,
> > +                                       unsigned long dirty,
> > +                                       unsigned long bdi_thresh,
> > +                                       unsigned long bdi_dirty)
> > +{
> > +       unsigned long limit = hard_dirty_limit(thresh);
> > +       unsigned long origin;
> > +       unsigned long goal;
> > +       unsigned long long span;
> > +       unsigned long long pos_ratio;   /* for scaling up/down the rate limit */
> > +
> > +       if (unlikely(dirty >= limit))
> > +               return 0;
> > +
> > +       /*
> > +        * global setpoint
> > +        */
> > +       goal = thresh - thresh / DIRTY_SCOPE;
> > +       origin = 4 * thresh;
> > +
> > +       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> > +               origin = limit;                 /* auxiliary control line */
> > +               goal = (goal + origin) / 2;
> > +               pos_ratio >>= 1; 
> 
> use before init?

Yeah it's embarrassing, I find this bug all the way back to the initial version...

It's actually dead code because (origin < limit) should never happen.
I feel so good being able to drop 5 more lines of code :)

Thanks,
Fengguang
---

--- linux-next.orig/mm/page-writeback.c	2011-08-08 21:56:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-08 22:04:48.000000000 +0800
@@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
 	goal = thresh - thresh / DIRTY_SCOPE;
 	origin = 4 * thresh;
 
-	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
-		origin = limit;			/* auxiliary control line */
-		goal = (goal + origin) / 2;
-		pos_ratio >>= 1;
-	}
 	pos_ratio = origin - dirty;
 	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
 	do_div(pos_ratio, origin - goal + 1);

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 14:11       ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-08 14:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 09:46:33PM +0800, Peter Zijlstra wrote:
> On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > +                                       unsigned long thresh,
> > +                                       unsigned long dirty,
> > +                                       unsigned long bdi_thresh,
> > +                                       unsigned long bdi_dirty)
> > +{
> > +       unsigned long limit = hard_dirty_limit(thresh);
> > +       unsigned long origin;
> > +       unsigned long goal;
> > +       unsigned long long span;
> > +       unsigned long long pos_ratio;   /* for scaling up/down the rate limit */
> > +
> > +       if (unlikely(dirty >= limit))
> > +               return 0;
> > +
> > +       /*
> > +        * global setpoint
> > +        */
> > +       goal = thresh - thresh / DIRTY_SCOPE;
> > +       origin = 4 * thresh;
> > +
> > +       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> > +               origin = limit;                 /* auxiliary control line */
> > +               goal = (goal + origin) / 2;
> > +               pos_ratio >>= 1; 
> 
> use before init?

Yeah it's embarrassing, I find this bug all the way back to the initial version...

It's actually dead code because (origin < limit) should never happen.
I feel so good being able to drop 5 more lines of code :)

Thanks,
Fengguang
---

--- linux-next.orig/mm/page-writeback.c	2011-08-08 21:56:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-08 22:04:48.000000000 +0800
@@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
 	goal = thresh - thresh / DIRTY_SCOPE;
 	origin = 4 * thresh;
 
-	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
-		origin = limit;			/* auxiliary control line */
-		goal = (goal + origin) / 2;
-		pos_ratio >>= 1;
-	}
 	pos_ratio = origin - dirty;
 	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
 	do_div(pos_ratio, origin - goal + 1);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-06  8:44   ` Wu Fengguang
  (?)
@ 2011-08-08 13:46     ` Peter Zijlstra
  -1 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-08 13:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +                                       unsigned long thresh,
> +                                       unsigned long dirty,
> +                                       unsigned long bdi_thresh,
> +                                       unsigned long bdi_dirty)
> +{
> +       unsigned long limit = hard_dirty_limit(thresh);
> +       unsigned long origin;
> +       unsigned long goal;
> +       unsigned long long span;
> +       unsigned long long pos_ratio;   /* for scaling up/down the rate limit */
> +
> +       if (unlikely(dirty >= limit))
> +               return 0;
> +
> +       /*
> +        * global setpoint
> +        */
> +       goal = thresh - thresh / DIRTY_SCOPE;
> +       origin = 4 * thresh;
> +
> +       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> +               origin = limit;                 /* auxiliary control line */
> +               goal = (goal + origin) / 2;
> +               pos_ratio >>= 1; 

use before init?

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 13:46     ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-08 13:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +                                       unsigned long thresh,
> +                                       unsigned long dirty,
> +                                       unsigned long bdi_thresh,
> +                                       unsigned long bdi_dirty)
> +{
> +       unsigned long limit = hard_dirty_limit(thresh);
> +       unsigned long origin;
> +       unsigned long goal;
> +       unsigned long long span;
> +       unsigned long long pos_ratio;   /* for scaling up/down the rate limit */
> +
> +       if (unlikely(dirty >= limit))
> +               return 0;
> +
> +       /*
> +        * global setpoint
> +        */
> +       goal = thresh - thresh / DIRTY_SCOPE;
> +       origin = 4 * thresh;
> +
> +       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> +               origin = limit;                 /* auxiliary control line */
> +               goal = (goal + origin) / 2;
> +               pos_ratio >>= 1; 

use before init?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 13:46     ` Peter Zijlstra
  0 siblings, 0 replies; 203+ messages in thread
From: Peter Zijlstra @ 2011-08-08 13:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +                                       unsigned long thresh,
> +                                       unsigned long dirty,
> +                                       unsigned long bdi_thresh,
> +                                       unsigned long bdi_dirty)
> +{
> +       unsigned long limit = hard_dirty_limit(thresh);
> +       unsigned long origin;
> +       unsigned long goal;
> +       unsigned long long span;
> +       unsigned long long pos_ratio;   /* for scaling up/down the rate limit */
> +
> +       if (unlikely(dirty >= limit))
> +               return 0;
> +
> +       /*
> +        * global setpoint
> +        */
> +       goal = thresh - thresh / DIRTY_SCOPE;
> +       origin = 4 * thresh;
> +
> +       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> +               origin = limit;                 /* auxiliary control line */
> +               goal = (goal + origin) / 2;
> +               pos_ratio >>= 1; 

use before init?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 2/5] writeback: dirty position control
  2011-08-06  8:44 [PATCH 0/5] IO-less dirty throttling v8 Wu Fengguang
  2011-08-06  8:44   ` Wu Fengguang
@ 2011-08-06  8:44   ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 7230 bytes --]

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

For simplicity, only the global/bdi setpoint control lines are
implemented here, so the [*] curve is more straight than the ideal one
showed in the above figure.

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |  143 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 143 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-06 10:31:32.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-06 11:17:07.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define BANDWIDTH_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -495,6 +497,147 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ *  When the number of dirty pages go higher/lower than the setpoint, the dirty
+ *  position ratio (and hence dirty rate limit) will be decreased/increased to
+ *  bring the dirty pages back to the setpoint.
+ *
+ *                              setpoint
+ *                                 v
+ * |-------------------------------*-------------------------------|-----------|
+ * ^                               ^                               ^           ^
+ * (thresh + background_thresh)/2  thresh - thresh/DIRTY_SCOPE     thresh  limit
+ *
+ *                          bdi setpoint
+ *                                 v
+ * |-------------------------------*-------------------------------------------|
+ * ^                               ^                                           ^
+ * 0                               bdi_thresh - bdi_thresh/DIRTY_SCOPE     limit
+ *
+ * (o) pseudo code
+ *
+ *     pos_ratio = 1 << BANDWIDTH_CALC_SHIFT
+ *
+ *     if (dirty < thresh) scale up   pos_ratio
+ *     if (dirty > thresh) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_thresh) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_thresh) scale down pos_ratio
+ *
+ * (o) global/bdi control lines
+ *
+ * Based on the number of dirty pages (the X), pos_ratio (the Y) is scaled by
+ * several control lines in turn.
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * If any control line drops below Y=0 before reaching @limit, an auxiliary
+ * line will be setup to connect them. The below figure illustrates the main
+ * bdi control line with an auxiliary line extending it to @limit.
+ *
+ * This allows smoothly throttling bdi_dirty down to normal if it starts high
+ * in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to 5 times higher than bdi setpoint.
+ * - the bdi dirty thresh goes down quickly due to change of JBOD workload
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, bw scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, bw scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0                 bdi setpoint                 bdi origin            limit
+ *
+ * The bdi control line: if (origin < limit), an auxiliary control line (*)
+ * will be setup to extend the main control line (o) to @limit.
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long origin;
+	unsigned long goal;
+	unsigned long long span;
+	unsigned long long pos_ratio;	/* for scaling up/down the rate limit */
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 */
+	goal = thresh - thresh / DIRTY_SCOPE;
+	origin = 4 * thresh;
+
+	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
+		origin = limit;			/* auxiliary control line */
+		goal = (goal + origin) / 2;
+		pos_ratio >>= 1;
+	}
+	pos_ratio = origin - dirty;
+	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
+	do_div(pos_ratio, origin - goal + 1);
+
+	/*
+	 * bdi setpoint
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE;
+	/*
+	 * Use span=(4*bw) in single disk case and transit to bdi_thresh in
+	 * JBOD case.  For JBOD, bdi_thresh could fluctuate up to its own size.
+	 * Otherwise the bdi write bandwidth is good for limiting the floating
+	 * area, which makes the bdi control line a good backup when the global
+	 * control line is too flat/weak in large memory systems.
+	 */
+	span = (u64) bdi_thresh * (thresh - bdi_thresh) +
+		(4 * bdi->avg_write_bandwidth) * bdi_thresh;
+	do_div(span, thresh + 1);
+	origin = goal + 2 * span;
+
+	if (unlikely(bdi_dirty > goal + span)) {
+		if (bdi_dirty > limit)
+			return 0;
+		if (origin < limit) {
+			origin = limit;		/* auxiliary control line */
+			goal += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= origin - bdi_dirty;
+	do_div(pos_ratio, origin - goal + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)



^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 2/5] writeback: dirty position control
@ 2011-08-06  8:44   ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 7533 bytes --]

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

For simplicity, only the global/bdi setpoint control lines are
implemented here, so the [*] curve is more straight than the ideal one
showed in the above figure.

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |  143 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 143 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-06 10:31:32.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-06 11:17:07.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define BANDWIDTH_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -495,6 +497,147 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ *  When the number of dirty pages go higher/lower than the setpoint, the dirty
+ *  position ratio (and hence dirty rate limit) will be decreased/increased to
+ *  bring the dirty pages back to the setpoint.
+ *
+ *                              setpoint
+ *                                 v
+ * |-------------------------------*-------------------------------|-----------|
+ * ^                               ^                               ^           ^
+ * (thresh + background_thresh)/2  thresh - thresh/DIRTY_SCOPE     thresh  limit
+ *
+ *                          bdi setpoint
+ *                                 v
+ * |-------------------------------*-------------------------------------------|
+ * ^                               ^                                           ^
+ * 0                               bdi_thresh - bdi_thresh/DIRTY_SCOPE     limit
+ *
+ * (o) pseudo code
+ *
+ *     pos_ratio = 1 << BANDWIDTH_CALC_SHIFT
+ *
+ *     if (dirty < thresh) scale up   pos_ratio
+ *     if (dirty > thresh) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_thresh) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_thresh) scale down pos_ratio
+ *
+ * (o) global/bdi control lines
+ *
+ * Based on the number of dirty pages (the X), pos_ratio (the Y) is scaled by
+ * several control lines in turn.
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * If any control line drops below Y=0 before reaching @limit, an auxiliary
+ * line will be setup to connect them. The below figure illustrates the main
+ * bdi control line with an auxiliary line extending it to @limit.
+ *
+ * This allows smoothly throttling bdi_dirty down to normal if it starts high
+ * in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to 5 times higher than bdi setpoint.
+ * - the bdi dirty thresh goes down quickly due to change of JBOD workload
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, bw scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, bw scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0                 bdi setpoint                 bdi origin            limit
+ *
+ * The bdi control line: if (origin < limit), an auxiliary control line (*)
+ * will be setup to extend the main control line (o) to @limit.
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long origin;
+	unsigned long goal;
+	unsigned long long span;
+	unsigned long long pos_ratio;	/* for scaling up/down the rate limit */
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 */
+	goal = thresh - thresh / DIRTY_SCOPE;
+	origin = 4 * thresh;
+
+	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
+		origin = limit;			/* auxiliary control line */
+		goal = (goal + origin) / 2;
+		pos_ratio >>= 1;
+	}
+	pos_ratio = origin - dirty;
+	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
+	do_div(pos_ratio, origin - goal + 1);
+
+	/*
+	 * bdi setpoint
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE;
+	/*
+	 * Use span=(4*bw) in single disk case and transit to bdi_thresh in
+	 * JBOD case.  For JBOD, bdi_thresh could fluctuate up to its own size.
+	 * Otherwise the bdi write bandwidth is good for limiting the floating
+	 * area, which makes the bdi control line a good backup when the global
+	 * control line is too flat/weak in large memory systems.
+	 */
+	span = (u64) bdi_thresh * (thresh - bdi_thresh) +
+		(4 * bdi->avg_write_bandwidth) * bdi_thresh;
+	do_div(span, thresh + 1);
+	origin = goal + 2 * span;
+
+	if (unlikely(bdi_dirty > goal + span)) {
+		if (bdi_dirty > limit)
+			return 0;
+		if (origin < limit) {
+			origin = limit;		/* auxiliary control line */
+			goal += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= origin - bdi_dirty;
+	do_div(pos_ratio, origin - goal + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 2/5] writeback: dirty position control
@ 2011-08-06  8:44   ` Wu Fengguang
  0 siblings, 0 replies; 203+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 7533 bytes --]

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

For simplicity, only the global/bdi setpoint control lines are
implemented here, so the [*] curve is more straight than the ideal one
showed in the above figure.

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |  143 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 143 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-06 10:31:32.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-06 11:17:07.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define BANDWIDTH_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -495,6 +497,147 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ *  When the number of dirty pages go higher/lower than the setpoint, the dirty
+ *  position ratio (and hence dirty rate limit) will be decreased/increased to
+ *  bring the dirty pages back to the setpoint.
+ *
+ *                              setpoint
+ *                                 v
+ * |-------------------------------*-------------------------------|-----------|
+ * ^                               ^                               ^           ^
+ * (thresh + background_thresh)/2  thresh - thresh/DIRTY_SCOPE     thresh  limit
+ *
+ *                          bdi setpoint
+ *                                 v
+ * |-------------------------------*-------------------------------------------|
+ * ^                               ^                                           ^
+ * 0                               bdi_thresh - bdi_thresh/DIRTY_SCOPE     limit
+ *
+ * (o) pseudo code
+ *
+ *     pos_ratio = 1 << BANDWIDTH_CALC_SHIFT
+ *
+ *     if (dirty < thresh) scale up   pos_ratio
+ *     if (dirty > thresh) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_thresh) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_thresh) scale down pos_ratio
+ *
+ * (o) global/bdi control lines
+ *
+ * Based on the number of dirty pages (the X), pos_ratio (the Y) is scaled by
+ * several control lines in turn.
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * If any control line drops below Y=0 before reaching @limit, an auxiliary
+ * line will be setup to connect them. The below figure illustrates the main
+ * bdi control line with an auxiliary line extending it to @limit.
+ *
+ * This allows smoothly throttling bdi_dirty down to normal if it starts high
+ * in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to 5 times higher than bdi setpoint.
+ * - the bdi dirty thresh goes down quickly due to change of JBOD workload
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, bw scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, bw scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0                 bdi setpoint                 bdi origin            limit
+ *
+ * The bdi control line: if (origin < limit), an auxiliary control line (*)
+ * will be setup to extend the main control line (o) to @limit.
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long origin;
+	unsigned long goal;
+	unsigned long long span;
+	unsigned long long pos_ratio;	/* for scaling up/down the rate limit */
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 */
+	goal = thresh - thresh / DIRTY_SCOPE;
+	origin = 4 * thresh;
+
+	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
+		origin = limit;			/* auxiliary control line */
+		goal = (goal + origin) / 2;
+		pos_ratio >>= 1;
+	}
+	pos_ratio = origin - dirty;
+	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
+	do_div(pos_ratio, origin - goal + 1);
+
+	/*
+	 * bdi setpoint
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE;
+	/*
+	 * Use span=(4*bw) in single disk case and transit to bdi_thresh in
+	 * JBOD case.  For JBOD, bdi_thresh could fluctuate up to its own size.
+	 * Otherwise the bdi write bandwidth is good for limiting the floating
+	 * area, which makes the bdi control line a good backup when the global
+	 * control line is too flat/weak in large memory systems.
+	 */
+	span = (u64) bdi_thresh * (thresh - bdi_thresh) +
+		(4 * bdi->avg_write_bandwidth) * bdi_thresh;
+	do_div(span, thresh + 1);
+	origin = goal + 2 * span;
+
+	if (unlikely(bdi_dirty > goal + span)) {
+		if (bdi_dirty > limit)
+			return 0;
+		if (origin < limit) {
+			origin = limit;		/* auxiliary control line */
+			goal += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= origin - bdi_dirty;
+	do_div(pos_ratio, origin - goal + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 203+ messages in thread

end of thread, other threads:[~2011-09-06 12:40 UTC | newest]

Thread overview: 203+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-16  2:20 [PATCH 0/5] IO-less dirty throttling v9 Wu Fengguang
2011-08-16  2:20 ` Wu Fengguang
2011-08-16  2:20 ` [PATCH 1/5] writeback: account per-bdi accumulated dirtied pages Wu Fengguang
2011-08-16  2:20   ` Wu Fengguang
2011-08-16  2:20 ` [PATCH 2/5] writeback: dirty position control Wu Fengguang
2011-08-16  2:20   ` Wu Fengguang
2011-08-16  2:20   ` Wu Fengguang
2011-08-16 19:41   ` Jan Kara
2011-08-16 19:41     ` Jan Kara
2011-08-17 13:23     ` Wu Fengguang
2011-08-17 13:49       ` Wu Fengguang
2011-08-17 13:49         ` Wu Fengguang
2011-08-17 20:24       ` Jan Kara
2011-08-17 20:24         ` Jan Kara
2011-08-18  4:18         ` Wu Fengguang
2011-08-18  4:18           ` Wu Fengguang
2011-08-18  4:41           ` Wu Fengguang
2011-08-18  4:41             ` Wu Fengguang
2011-08-18 19:16           ` Jan Kara
2011-08-18 19:16             ` Jan Kara
2011-08-24  3:16         ` Wu Fengguang
2011-08-24  3:16           ` Wu Fengguang
2011-08-19  2:53   ` Vivek Goyal
2011-08-19  2:53     ` Vivek Goyal
2011-08-19  3:25     ` Wu Fengguang
2011-08-19  3:25       ` Wu Fengguang
2011-08-16  2:20 ` [PATCH 3/5] writeback: dirty rate control Wu Fengguang
2011-08-16  2:20   ` Wu Fengguang
2011-08-16  2:20   ` Wu Fengguang
2011-08-16  2:20 ` [PATCH 4/5] writeback: per task dirty rate limit Wu Fengguang
2011-08-16  2:20   ` Wu Fengguang
2011-08-16  7:17   ` Andrea Righi
2011-08-16  7:17     ` Andrea Righi
2011-08-16  7:22     ` Wu Fengguang
2011-08-16  7:22       ` Wu Fengguang
2011-08-16  2:20 ` [PATCH 5/5] writeback: IO-less balance_dirty_pages() Wu Fengguang
2011-08-16  2:20   ` Wu Fengguang
2011-08-16  2:20   ` Wu Fengguang
2011-08-19  2:06   ` Vivek Goyal
2011-08-19  2:06     ` Vivek Goyal
2011-08-19  2:54     ` Wu Fengguang
2011-08-19  2:54       ` Wu Fengguang
2011-08-19 19:00       ` Vivek Goyal
2011-08-19 19:00         ` Vivek Goyal
2011-08-21  3:46         ` Wu Fengguang
2011-08-21  3:46           ` Wu Fengguang
2011-08-22 17:22           ` Vivek Goyal
2011-08-22 17:22             ` Vivek Goyal
2011-08-23  1:07             ` Wu Fengguang
2011-08-23  1:07               ` Wu Fengguang
2011-08-23  3:53               ` Wu Fengguang
2011-08-23  3:53                 ` Wu Fengguang
2011-08-23 13:53               ` Vivek Goyal
2011-08-23 13:53                 ` Vivek Goyal
2011-08-24  3:09                 ` Wu Fengguang
2011-08-24  3:09                   ` Wu Fengguang
     [not found] <CAFdhcLRKvfqBnXCXLwq-Qe1eNAGC-8XJ3BtHpQKzaa3RhHyp6A@mail.gmail.com>
2011-08-17  6:40 ` [PATCH 2/5] writeback: dirty position control David Horner
2011-08-17 12:03   ` Jan Kara
2011-08-17 12:35     ` Wu Fengguang
  -- strict thread matches above, loose matches on Subject: below --
2011-08-06  8:44 [PATCH 0/5] IO-less dirty throttling v8 Wu Fengguang
2011-08-06  8:44 ` [PATCH 2/5] writeback: dirty position control Wu Fengguang
2011-08-06  8:44   ` Wu Fengguang
2011-08-06  8:44   ` Wu Fengguang
2011-08-08 13:46   ` Peter Zijlstra
2011-08-08 13:46     ` Peter Zijlstra
2011-08-08 13:46     ` Peter Zijlstra
2011-08-08 14:11     ` Wu Fengguang
2011-08-08 14:11       ` Wu Fengguang
2011-08-08 14:31       ` Peter Zijlstra
2011-08-08 14:31         ` Peter Zijlstra
2011-08-08 14:31         ` Peter Zijlstra
2011-08-08 22:47         ` Wu Fengguang
2011-08-08 22:47           ` Wu Fengguang
2011-08-09  9:31           ` Peter Zijlstra
2011-08-09  9:31             ` Peter Zijlstra
2011-08-09  9:31             ` Peter Zijlstra
2011-08-10 12:28             ` Wu Fengguang
2011-08-10 12:28               ` Wu Fengguang
2011-08-08 14:41       ` Peter Zijlstra
2011-08-08 14:41         ` Peter Zijlstra
2011-08-08 14:41         ` Peter Zijlstra
2011-08-08 23:05         ` Wu Fengguang
2011-08-08 23:05           ` Wu Fengguang
2011-08-09 10:32           ` Peter Zijlstra
2011-08-09 10:32             ` Peter Zijlstra
2011-08-09 10:32             ` Peter Zijlstra
2011-08-09 17:20           ` Peter Zijlstra
2011-08-09 17:20             ` Peter Zijlstra
2011-08-09 17:20             ` Peter Zijlstra
2011-08-10 22:34             ` Jan Kara
2011-08-10 22:34               ` Jan Kara
2011-08-11  2:29               ` Wu Fengguang
2011-08-11  2:29                 ` Wu Fengguang
2011-08-11 11:14                 ` Jan Kara
2011-08-11 11:14                   ` Jan Kara
2011-08-16  8:35                   ` Wu Fengguang
2011-08-16  8:35                     ` Wu Fengguang
2011-08-12 13:19             ` Wu Fengguang
2011-08-12 13:19               ` Wu Fengguang
2011-08-10 21:40           ` Vivek Goyal
2011-08-10 21:40             ` Vivek Goyal
2011-08-16  8:55             ` Wu Fengguang
2011-08-16  8:55               ` Wu Fengguang
2011-08-11 22:56           ` Peter Zijlstra
2011-08-11 22:56             ` Peter Zijlstra
2011-08-11 22:56             ` Peter Zijlstra
2011-08-12  2:43             ` Wu Fengguang
2011-08-12  2:43               ` Wu Fengguang
2011-08-12  3:18               ` Wu Fengguang
2011-08-12  5:45               ` Wu Fengguang
2011-08-12  5:45                 ` Wu Fengguang
2011-08-12  9:45                 ` Peter Zijlstra
2011-08-12  9:45                   ` Peter Zijlstra
2011-08-12  9:45                   ` Peter Zijlstra
2011-08-12 11:07                   ` Wu Fengguang
2011-08-12 11:07                     ` Wu Fengguang
2011-08-12 12:17                     ` Peter Zijlstra
2011-08-12 12:17                       ` Peter Zijlstra
2011-08-12 12:17                       ` Peter Zijlstra
2011-08-12  9:47               ` Peter Zijlstra
2011-08-12  9:47                 ` Peter Zijlstra
2011-08-12  9:47                 ` Peter Zijlstra
2011-08-12 11:11                 ` Wu Fengguang
2011-08-12 11:11                   ` Wu Fengguang
2011-08-12 12:54           ` Peter Zijlstra
2011-08-12 12:54             ` Peter Zijlstra
2011-08-12 12:54             ` Peter Zijlstra
2011-08-12 12:59             ` Wu Fengguang
2011-08-12 12:59               ` Wu Fengguang
2011-08-12 13:08               ` Peter Zijlstra
2011-08-12 13:08                 ` Peter Zijlstra
2011-08-12 13:08                 ` Peter Zijlstra
2011-08-12 13:04           ` Peter Zijlstra
2011-08-12 13:04             ` Peter Zijlstra
2011-08-12 13:04             ` Peter Zijlstra
2011-08-12 14:20             ` Wu Fengguang
2011-08-12 14:20               ` Wu Fengguang
2011-08-22 15:38               ` Peter Zijlstra
2011-08-22 15:38                 ` Peter Zijlstra
2011-08-22 15:38                 ` Peter Zijlstra
2011-08-23  3:40                 ` Wu Fengguang
2011-08-23  3:40                   ` Wu Fengguang
2011-08-23 10:01                   ` Peter Zijlstra
2011-08-23 10:01                     ` Peter Zijlstra
2011-08-23 10:01                     ` Peter Zijlstra
2011-08-23 14:15                     ` Wu Fengguang
2011-08-23 14:15                       ` Wu Fengguang
2011-08-23 17:47                       ` Vivek Goyal
2011-08-23 17:47                         ` Vivek Goyal
2011-08-24  0:12                         ` Wu Fengguang
2011-08-24  0:12                           ` Wu Fengguang
2011-08-24 16:12                           ` Peter Zijlstra
2011-08-24 16:12                             ` Peter Zijlstra
2011-08-26  0:18                             ` Wu Fengguang
2011-08-26  0:18                               ` Wu Fengguang
2011-08-26  9:04                               ` Peter Zijlstra
2011-08-26  9:04                                 ` Peter Zijlstra
2011-08-26 10:04                                 ` Wu Fengguang
2011-08-26 10:04                                   ` Wu Fengguang
2011-08-26 10:42                                   ` Peter Zijlstra
2011-08-26 10:42                                     ` Peter Zijlstra
2011-08-26 10:52                                     ` Wu Fengguang
2011-08-26 10:52                                       ` Wu Fengguang
2011-08-26 11:26                                   ` Wu Fengguang
2011-08-26 12:11                                     ` Peter Zijlstra
2011-08-26 12:11                                       ` Peter Zijlstra
2011-08-26 12:20                                       ` Wu Fengguang
2011-08-26 12:20                                         ` Wu Fengguang
2011-08-26 13:13                                         ` Wu Fengguang
2011-08-26 13:18                                           ` Peter Zijlstra
2011-08-26 13:18                                             ` Peter Zijlstra
2011-08-26 13:24                                             ` Wu Fengguang
2011-08-26 13:24                                               ` Wu Fengguang
2011-08-24 18:00                           ` Vivek Goyal
2011-08-24 18:00                             ` Vivek Goyal
2011-08-25  3:19                             ` Wu Fengguang
2011-08-25  3:19                               ` Wu Fengguang
2011-08-25 22:20                               ` Vivek Goyal
2011-08-25 22:20                                 ` Vivek Goyal
2011-08-26  1:56                                 ` Wu Fengguang
2011-08-26  1:56                                   ` Wu Fengguang
2011-08-26  8:56                                   ` Peter Zijlstra
2011-08-26  8:56                                     ` Peter Zijlstra
2011-08-26  9:53                                     ` Wu Fengguang
2011-08-26  9:53                                       ` Wu Fengguang
2011-08-29 13:12                             ` Peter Zijlstra
2011-08-29 13:12                               ` Peter Zijlstra
2011-08-29 13:37                               ` Wu Fengguang
2011-08-29 13:37                                 ` Wu Fengguang
2011-09-02 12:16                                 ` Peter Zijlstra
2011-09-02 12:16                                   ` Peter Zijlstra
2011-09-06 12:40                                 ` Peter Zijlstra
2011-09-06 12:40                                   ` Peter Zijlstra
2011-08-24 15:57                       ` Peter Zijlstra
2011-08-24 15:57                         ` Peter Zijlstra
2011-08-24 15:57                         ` Peter Zijlstra
2011-08-25  5:30                         ` Wu Fengguang
2011-08-25  5:30                           ` Wu Fengguang
2011-08-23 14:36                     ` Vivek Goyal
2011-08-23 14:36                       ` Vivek Goyal
2011-08-09  2:08   ` Vivek Goyal
2011-08-09  2:08     ` Vivek Goyal
2011-08-16  8:59     ` Wu Fengguang
2011-08-16  8:59       ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.